design Is a Wasm engine allowed to reorder loads?

Please consider only "normal", non-volatile and non-atomic loads in this question / argument.

In principle a Wasm engine could generate more efficient code by allowing load-load reordering. This is normally possible on native architectures since the reordering of loads does not have observable effects.

In Wasm this is almost true as well, if not for the fact that an out-of-bound access will trap which is a visible side-effect.

This seems to suggest that load-load reordering is not allowed, but it also seems like a bad idea to prevent engines from generating better code only due to the highly unlikely case of out-of-bounds accesses.

In practical terms, it seems to me that V8 at this time considers Wasm loads like stores and calls for the purpose of instruction merging and reordering. I'd like to understand if this behavior is required by the standard or not.

A strictly related question is: is an engine allowed to eliminate a never used load instruction?

Mar 14 '21 17:03 alexp-sssup

I believe that's a true limitation. Reordering loads can alter observable stack traces in the case of a trap, which engines should not do. (Unless they can do so in an unobservable way - but that sounds quite hard in general.)

Likewise removing (drop (load ..)) would be disallowed because it could remove a trap.

However, toolchains can do such things. For example the binaryen optimizer will reorder traps, allowing reordering of loads (and a load with something else that might trap like an integer division, etc.). We've thought to add an option to avoid that if the need arose, but it hasn't.

Binaryen also has an option that helps with (drop (load ..)) - "ignore implicit traps" will assume (unreachable) can trap but nothing else will, which means it can assume loads have no side effects, and are removable if not used.

Overall I think this is the intent of the design of wasm, to leave most optimization work at the toolchain level, and have the VMs be as simple as possible.

Mar 14 '21 17:03 kripken

The reality is that toolchains cannot express all the required optimizations though. Consider the following code sequence:

get.local X
i32.load
i32.const Y
i32.eq

Ideally an engine could compile this to a single instruction using a memory operand, for example

cmp [mem],0xY

In this specific case there is only 1 load, so there is no reordering and the engine could still achieve this efficient codegen, but the constraint about reordering loads makes it more complex in practice. It seems to me that this limitation makes the VM simpler only if lower quality codegen is accepted.

Mar 14 '21 18:03 alexp-sssup

Hypothetically, an engine could reorder pairs of loads with no intervening side-effects, so long as it gives up on the ability to always precisely report which load instruction/line number is responsible for an OOB load trap. This is allowed by the core spec, but engines may be choosing not to relax their error-reporting in this way.

It's hard to come up with a weaker spec which could remain deterministic, which people have historically considered important, even for trap cases.

EDIT: with regards to the follow-on in the OP, I believe such dead loads can't be eliminated unless they're statically inferred to be in-bounds. However toolchains can handle these with less difficulty than the example above.

Mar 14 '21 18:03 conrad-watt

It's worth noting that in the given example, engines are already optimizing sequences of load / compare / branch into single instructions in their optimizing tiers, by virtue of going to an IR and then doing instruction selection.

You are right to point out that load-load reordering can happen modulo traps. Normally, out-of-order CPUs do plenty of dynamic memory reordering due to their built-in disambiguation, and they will work hard to make this unobservable if either load generates a fault. CPUs cannot hoist loads out of loops, however. In order for engines to do this, they either need to prove that the access is in bounds or they need to do a lot of work in the case of an actual fault. Then there's the issue of the memory model, of course.

Generally I think load-load reordering is either better done in the producer, which probably has more accurate information about aliasing and the possibility of traps.

Mar 17 '21 16:03 titzer

After some additional thinking on the matter, we realized that loads which can be proven to be fully contained into the "initial" size for the memory of a Wasm module can never fault, and hence can be fully reordered by any engine. A prototype implementation for this idea has been proposed to V8: https://bugs.chromium.org/p/v8/issues/detail?id=11802

May 21 '21 07:05 alexp-sssup

The question here seems answered; please file new issues if there are further questions!

Feb 23 '24 16:02 sunfishcode