local-cse in O3 introduces side-effecting across structurally isolated sub-trees thus prevent const prop & DCE
Given the following code:
(module
(import "External" "external_function" (func $external_function))
(func $_start (param $0 i64) (param $1 i32) (param $2 i32) (param $3 i32)
(local $4 i32) (local $5 i32)
i32.const 0
call $foo
drop
)
(func $foo (param $0 i32) (result i32)
(local $1 i32) (local $2 i32) (local $4 i32) (local $5 i32) (local $scratch i32) (local $3 i64) (local $6 i64)
i32.const 77986
i32.load16_u
i32.const 16
local.tee $2
i32.shl
local.get $2
i32.shr_s
i32.const 65533
i32.eq
i32.const 77986
i32.load16_u
i32.const 16
i32.shl
i32.and
if (result i32) ;; label = @4
call $external_function
i32.const 0
else
i32.const 1
end
)
(memory $0 258 258)
(export "_start" (func $_start)))
wasm-opt (d0d970cb5) eliminates the dead br_if body by -all -O2 but can not do that by -all -O3.
Analysis
Similar but different to #7440, this time wasm-opt in O3 introduces side-effect in more complex structure by local-cse:
The figure below depicts how the local.cse transforms the input:
It introduces the local.tee for less code size, however, it also affects the constant propagation and further blocks the dead code elimination.
Different from #7440, whose solution is simply enhancing the optimization of the binaryOp regardless that the side effects of its children, this issues is more complex because the local-cse does CSE in different depth and sub-tree, is there any code logic which can check or deal with this issue?
We do track the bits of locals, but I guess that isn't enough here - we need to see that the shl and shr_s combine to form a 16-bit sign-extend (after that, comparison to 65533 is always false, as the sign bit of a 16-bit number can't be set without the higher bits as well).
We could in theory track "partial sign-extends". Or, more generally, we could "look through" locals to their source, maybe in some optimize-instructions-propagate that would parallel precompute-propagate (in that it looks through locals, using a LocalGraph; perhaps we could do it concretely by expanding "getFallthrough", that is, we'd look not just at fallthrough values in the expression itself, but that arrive via locals). Another option could be to "expand" local-cse results like this, knowing we can re-optimize them later if we fail to find a pattern for them.
None of those options is simple, and I tend to think the benefit is low, so I'd say this is low priority.
Understood, and I agree. It’s might be better to leave this for later.