local-cse introduces side effects via local.tee, blocking DCE in O3
Given the following code:
(module
(type $0 (func))
(type $1 (func (param i32 i32) (result i32)))
(import "External" "external_function" (func $external_function (type $0)))
(func $_start (type $1) (param $0 i32) (param $1 i32) (result i32)
(local $2 i32) (local $4 i32) (local $7 i32) (local $8 i32) (local $11 i32) (local $13 i32)
global.get $__stack_pointer
local.set $2
local.get $2
i64.const 0
i64.store offset=1752
i32.const 4058
local.set $4
local.get $2
local.get $4
i32.add
i32.const 1
i32.add
local.set $7
i32.const 4058
local.set $8
local.get $2
local.get $8
i32.add
i32.const 1
i32.add
local.set $11
local.get $7
local.get $11
i32.ne
local.set $13
block ;; label = @1
local.get $13
i32.eqz
br_if 0 (;@1;)
call $external_function
end
unreachable)
(memory $0 259 259)
(global $__stack_pointer (mut i32) (i32.const 0))
(export "_start" (func $_start)))
wasm-opt (862aeb9) eliminates the dead br_if body by -all -O2 but can not do that by -all -O3.
After analysis, O3 performs one more local-cse than O2, thus introducing local.tee (e.g., combining local.set + local.get). However, it also introduces side effect, which blocks constant propagation and then finally blocks DCE
Specifically, through local-cse, wasm-opt transforms the condition (always false) of br_if statements from
(i32.ne
(i32.add
(i32.add
(local.get $0)
(i32.const 4058)
)
(i32.const 1)
)
(i32.add
(i32.add
(local.get $0)
(i32.const 4058)
)
(i32.const 1)
)
)
to
(i32.ne
(local.tee $2
(i32.add
(i32.add
(local.get $0)
(i32.const 4058)
)
(i32.const 1)
)
)
(local.get $2)
)
thus blocking the constant propagation, leading to DCE missed optimization.
Overall, I think it is a missed optimization.
Yes, the issue here is that
(i32.ne
(local.tee $0
(i32.add
(local.get $0)
(i32.const 4059)
)
)
(local.get $0)
)
does not get optimized. The two inputs are equal, so we can optimize it, but the local.tee confuses us.
Looks like that happens in optimizeBinaryWithEqualEffectlessChildren, which should be renamed and made to handle children with effects. We can do that by using areConsecutiveInputsEqual, which handles the tee/get pattern.
A duplicated case is found:
(module
(import "External" "external_function" (func $external_function))
(func $_start (param $2 i32) (param $3 i32)
(local $4 i32)
global.get $__stack_pointer
local.set $2
block ;; label = @1
local.get $4
i32.eqz
br_if 0 (;@1;)
end
block ;; label = @1
i32.const 7920
local.get $2
i32.add
local.get $2
i32.const 7920
i32.add
i32.eq
br_if 0 (;@1;)
call $external_function
end
)
(global $__stack_pointer (mut i32) (i32.const 0))
(export "_start" (func $_start)))
Currently newest wasm-opt version (8c82b68) performs worse by -O3 than by -O2.
O3 results
(if
(i32.ne
(local.tee $0
(i32.const 0)
)
(local.get $0)
)
(then
(call $external_function)
)
)
)
while O2 can deduce it to a nop (remove all things).
We've discussed the root cause: wasm-opt cannot deduce the binary operation with two side effect children. And already submitted a PR with all CI pass, however, it looks wrong with an updated case, so it is deferred to now.
Since this missed optimization is common, we'd better to continue and finish it?
https://github.com/WebAssembly/binaryen/pull/7460#discussion_r2037656490
#7863 helps here, and might help with other similar issues.
The last testcase here now optimizes fully with -O3. The original testcase requires -O3 -O3, so there is still a minor issue there, but just of efficiency.
Sorry for the delay. The refactor of the precompute looks good, and is smarter than adding ad hoc logic as before.