O3's late condition deduction and missing DCE prevent dead br_if elimination
Given the following code:
(module
(import "External" "external_function" (func $external_function))
(func $_start (result i32)
(local i32 i32 i32 i32 i32 i32 i32 i32 i32)
call $foo
local.set 1
i32.const 1
local.set 2
local.get 1
local.get 2
i32.and
local.set 3
i32.const -556321873
local.set 4
local.get 3
local.get 4
i32.ge_u
local.set 5
i32.const 0
i32.load
local.set 6
local.get 5
local.get 6
i32.and
local.set 7
i32.const 0
local.set 0
block ;; label = @1
local.get 7
i32.eqz
br_if 0 (;@1;)
call $external_function
i32.const 0
local.set 0
end
local.get 0
local.set 8
i32.const 0
local.get 8
i32.store
unreachable)
(func $foo (result i32)
i32.const 0
i32.const 0
i32.store8
i32.const 0)
(memory (;0;) 258 258)
(export "_start" (func $_start)))
wasm-opt (16dbac1) can eliminate the dead br_if body by -all -O2 but cannot by -all -O3.
Analysis
The most probable issue lies in the optimize-instructions pass:
- In O3,
optimize-instructionsfails to deduce the condition to zero:- While in O2, it successfully deduces the condition to zero.
- Additional DCE is performed in the
updateAfterInliningofinlining-optimizingin O2.
- In O3, the condition is eventually reduced to zero after
inlining-optimizing, but no further DCE (likevacuum) is performed.- This makes the optimization too late compared to O2.
In short, there appears to be a missed optimization in optimize-instructions for O3.
I'm not sure this is an issue in OptimizeInstructions, or at least I can't see what to change there to make this better?
On the other hand, adding --merge-blocks --vacuum after -O3 fixes it. Some structures just take more optimization cycles. Specifically merging blocks moves stuff out of the if condition, so it is easily optimized.
If we had a "keep optimizing til fixed point" mode then this would be optimized fully, but we run a fixed pipeline for simplicity, which works most of the time, and avoids unpredictable compilation times.
I understand what you mean. It’s probably a case where wasm-opt doesn't handle this well due to the ordering and interaction of passes. It's tricky to fully grasp, and as you said, some optimizations might just need more cycles to reach an "optimal state" properly.
My question about is the ordering and interaction:
I've see the wasm-opt use merge-blocks and vacuum many times in function scope (in addDefaultFunctionOptimizationPasses). and not used anytime in addDefaultGlobalOptimizationPostPasses. Why couldn't add the two in the global scope, since after inlining the wasm-opt still invokes the local optimization OptimizeInstructions? Is it because of other considerations?
"global" passes operate on the entire module, while "function" passes operate on each function independently. "global" ones are used for things that need to look at multiple functions at a time, like inlining.
In theory if there was a benefit to running merge-blocks between global passes, we would do it. In fact, some global passes run function passes internally, like inlining-optimizing. But in the testcase here, the situation is pretty rare that we need it after the global opts, and I'm not sure the tradeoff is worth it.
In general, we want -O3 to be good for most use cases, but it is fine if sometimes the user must run -O3 -O3 (run it twice) to get the full benefits. And we don't want to make -O3 slower and slower, because that would slow down the common case. That's what I mean by tradeoffs here.
. And we don't want to make -O3 slower and slower, because that would slow down the common case. That's what I mean by tradeoffs here.
Yeah, I understand what you mean, especially the tradeoffs here. I think it is of low priority and we might be better to leave this for later.
Close because silently fixed in latest version (b89d65a).