binaryen icon indicating copy to clipboard operation
binaryen copied to clipboard

O3's late condition deduction and missing DCE prevent dead br_if elimination

Open xuruiyang2002 opened this issue 10 months ago • 5 comments

Given the following code:

(module
  (import "External" "external_function" (func $external_function))
  (func $_start (result i32)
    (local i32 i32 i32 i32 i32 i32 i32 i32 i32)
    call $foo
    local.set 1
    i32.const 1
    local.set 2
    local.get 1
    local.get 2
    i32.and
    local.set 3
    i32.const -556321873
    local.set 4
    local.get 3
    local.get 4
    i32.ge_u
    local.set 5
    i32.const 0
    i32.load
    local.set 6
    local.get 5
    local.get 6
    i32.and
    local.set 7
    i32.const 0
    local.set 0
    block  ;; label = @1
      local.get 7
      i32.eqz
      br_if 0 (;@1;)
      call $external_function
      i32.const 0
      local.set 0
    end
    local.get 0
    local.set 8
    i32.const 0
    local.get 8
    i32.store
    unreachable)
  (func $foo (result i32)
    i32.const 0
    i32.const 0
    i32.store8
    i32.const 0)
  (memory (;0;) 258 258)
  (export "_start" (func $_start)))

wasm-opt (16dbac1) can eliminate the dead br_if body by -all -O2 but cannot by -all -O3.

Analysis

The most probable issue lies in the optimize-instructions pass:

  • In O3, optimize-instructions fails to deduce the condition to zero:
    • While in O2, it successfully deduces the condition to zero.
    • Additional DCE is performed in the updateAfterInlining of inlining-optimizing in O2.
  • In O3, the condition is eventually reduced to zero after inlining-optimizing, but no further DCE (like vacuum) is performed.
    • This makes the optimization too late compared to O2.

In short, there appears to be a missed optimization in optimize-instructions for O3.

xuruiyang2002 avatar Apr 05 '25 07:04 xuruiyang2002

I'm not sure this is an issue in OptimizeInstructions, or at least I can't see what to change there to make this better?

On the other hand, adding --merge-blocks --vacuum after -O3 fixes it. Some structures just take more optimization cycles. Specifically merging blocks moves stuff out of the if condition, so it is easily optimized.

If we had a "keep optimizing til fixed point" mode then this would be optimized fully, but we run a fixed pipeline for simplicity, which works most of the time, and avoids unpredictable compilation times.

kripken avatar Apr 07 '25 20:04 kripken

I understand what you mean. It’s probably a case where wasm-opt doesn't handle this well due to the ordering and interaction of passes. It's tricky to fully grasp, and as you said, some optimizations might just need more cycles to reach an "optimal state" properly.

xuruiyang2002 avatar Apr 08 '25 09:04 xuruiyang2002

My question about is the ordering and interaction:

I've see the wasm-opt use merge-blocks and vacuum many times in function scope (in addDefaultFunctionOptimizationPasses). and not used anytime in addDefaultGlobalOptimizationPostPasses. Why couldn't add the two in the global scope, since after inlining the wasm-opt still invokes the local optimization OptimizeInstructions? Is it because of other considerations?

xuruiyang2002 avatar Apr 09 '25 09:04 xuruiyang2002

"global" passes operate on the entire module, while "function" passes operate on each function independently. "global" ones are used for things that need to look at multiple functions at a time, like inlining.

In theory if there was a benefit to running merge-blocks between global passes, we would do it. In fact, some global passes run function passes internally, like inlining-optimizing. But in the testcase here, the situation is pretty rare that we need it after the global opts, and I'm not sure the tradeoff is worth it.

In general, we want -O3 to be good for most use cases, but it is fine if sometimes the user must run -O3 -O3 (run it twice) to get the full benefits. And we don't want to make -O3 slower and slower, because that would slow down the common case. That's what I mean by tradeoffs here.

kripken avatar Apr 09 '25 20:04 kripken

. And we don't want to make -O3 slower and slower, because that would slow down the common case. That's what I mean by tradeoffs here.

Yeah, I understand what you mean, especially the tradeoffs here. I think it is of low priority and we might be better to leave this for later.

xuruiyang2002 avatar Apr 10 '25 07:04 xuruiyang2002

Close because silently fixed in latest version (b89d65a).

xuruiyang2002 avatar Jun 28 '25 12:06 xuruiyang2002