[Emitting Zero] [Local Tracking] General rule might not handle unary ne emitting zero bits well
Given the following code:
(module
(import "External" "external_function" (func $external_function))
(func $foo (result i32)
i32.const 0
i32.load)
(func $_start (param $0 i32) (param $1 i64) (param $2 i32)
(local $3 i32) (local $4 i32) (local $5 i32) (local $6 i32) (local $7 i32) (local $9 i32)
call $foo
local.set $4
i32.const 1
local.set $3
local.get $4
local.get $3
i32.shl
i32.const 1
i32.shr_s
local.set $6
i32.const 0
i32.const 0
i32.store
i32.const -259031342
local.set $7
local.get $6
local.get $7
i32.ne
local.set $9
block ;; label = @1
local.get $9
i32.eqz
br_if 0 (;@1;)
unreachable
end
call $bar)
(func $bar call $external_function)
(memory $0 258 258)
(export "_start" (func $_start)))
wasm-opt (755a8d0e) should deduce the condition to false, thus fold the branch to unreachable (further deleting call $bar), which works under -O2 but fails under -O3.
Below is optimized by -all -O3:
(func $_start (type $1) (param $0 i32) (param $1 i64) (param $2 i32)
(local.set $0
(i32.shr_s
(i32.shl
(i32.load
(i32.const 0)
)
(i32.const 1)
)
(i32.const 1)
)
)
(i32.store
(i32.const 0)
(i32.const 0)
)
(if
(i32.ne
(local.get $0)
(i32.const -259031342)
)
(then
(unreachable)
)
)
(call $external_function)
)
As you can see, the condition here
(i32.ne
(local.get $0)
(i32.const -259031342)
)
is not deduced to true.
Similar to #7492, Maybe there a missing rule for it to emit zero bits, or is the local tracking insufficient?
Yes, limitations of local tracking hit us here: we don't see all the contents in the local, just part of the pattern we are looking for.
I think, however, that a good fix for this would actually be in SimplifyLocals. That will normally move the local into the use:
(local.set $0
(i32.shr_s
(i32.shl
(i32.load
(i32.const 0)
)
(i32.const 1)
)
(i32.const 1)
)
)
(if
(i32.ne
(local.get $0)
(i32.const -259031342)
)
(then
(unreachable)
)
)
=>
(if
(i32.ne
(i32.shr_s
(i32.shl
(i32.load
(i32.const 0)
)
(i32.const 1)
)
(i32.const 1)
)
(i32.const -259031342)
)
(then
(unreachable)
)
)
(and remove the local.set). The problem here is the inner load, which has effects - that pass sees effects on the local.set's value, and gives up. We could do one of two things:
- Make it find the part without effects, and move only that.
- Add a pass that un-optimizes nested effects like that. That is,
(local.set $0
(i32.eqz
(i32.load)
)
)
=>
(local.set $inner
(i32.load)
)
(local.set $0
(i32.eqz
(local.get $inner)
)
)
This is the opposite of what SimplifyLocals does, but it would solve a whole range of these issues, I think.
That was a clear explanation; I understand now.
The i32.load (with side-effect) in the calculation for $0 prevents the SimplifyLocals from inlining calculation into the if condition. Without such inlining, Precompute-propagation doesn't see the full picture, thus fails to do further optimizations.
Sounds the first solution is limited, second is more general. Add a new pass which could find operations with side-effect and isolate them with temporary variable, thus SimplifyLocals could do inlining as normal.
But how much do these solutions cost? Is it good enough to fix #7492?
It wouldn't fix that particular issue - the problem there wasn't an inner expression with effects like the load in the last example.
But it would fix this, and iirc a few others. This is a general situation we don't yet handle well enough. Flatten was, in theory, the solution to this, but it does a lot more:
https://github.com/WebAssembly/binaryen/blob/e185ff94852efdab9d6d75fa542a7b5b5c785d27/src/ir/flat.h#L45-L55
We need 1 and 3 from that list, but not 2 and 4. Perhaps a variant of the Flatten pass that only does them is an option here, "minimally flatten" or "flatten expressions (but not control flow)".