binaryen icon indicating copy to clipboard operation
binaryen copied to clipboard

Running `wasm-opt` with `-Os` leaves NOP / dead control flow behind

Open mkustermann opened this issue 2 months ago • 6 comments

See attached flute.tar.gz

We start with an unoptimized flute.wasm file and run wasm-opt

% wasm-opt -g --enable-gc --enable-reference-types --enable-multivalue --enable-exception-handling --enable-nontrapping-float-to-int --enable-sign-ext --enable-bulk-memory --enable-threads '--no-inline=*<noInline>*' --traps-never-happen \
    -Os \
    flute.wasm \
    -o flute.Os.wasm

This -Os leaves NOP blocks behind:

 (func $_InterfaceType.toString ...
  ...
    block $label0 (result i32)
      local.get $var0
      ref.cast $_InterfaceType
      local.tee $var7
      struct.get $_InterfaceType $classId
      br $label0
    end $label0
  ...
)

It also doesn't prune branches with unreachable (some variants of this):

(func $_FunctionType.toString
  ...
            if (result (ref $JSStringImpl))
              local.get $var6
              unreachable
            else
              local.get $var6
              block $label5 (result (ref $JSStringImpl))
                local.get $var9
                i32.const 0
                global.get $"C8 DiagnosticLevel"
                ref.null none
                local.get $var9
                struct.get $_Type $field0
                i32.const 3066
                i32.add
                call_indirect (param (ref $#Top) i32 (ref $DiagnosticLevel) (ref null $TextTreeConfiguration)) (result (ref $JSStringImpl))
                br $label5
              end $label5
              call $JSStringImpl.+
            end
  ...  
)

If we run another -Os pass then those issues disappear

% wasm-opt -g --enable-gc --enable-reference-types --enable-multivalue --enable-exception-handling --enable-nontrapping-float-to-int --enable-sign-ext --enable-bulk-memory --enable-threads '--no-inline=*<noInline>*' --traps-never-happen \
    -Os \
    flute.Os.wasm \
    -o flute.Os.Os.wasm

From a users point of view, I wouldn't expect an -Os to leave these code patterns behind. We can of course run multiple -Os rounds, but that a) may do more work than needed and b) raises the question of how many -Os passes one may need to run to cleanup things left by previous -Os passes.

mkustermann avatar Dec 01 '25 08:12 mkustermann

This is a current limitation. Most passes will proceed while they see more work to do, but the pass pipeline as a whole does not keep running while one pass opens up work for another. Investigating this case, that is what happens here.

We do have an optional flag --converge, that will keep running all the passes while code size decreases, but it can have unpredictable compile times (and code size isn't always the right metric).

In the long term, we may want a smarter pass manager that sees which passes open up work for others and keeps running them, but there are no immediate plans for that. Running multiple -Os etc. is the simplest workaround for now.

If this is a significant problem, perhaps improving --converge is a short-term option we can consider, but I'd hope this is only a matter of a few bytes here and there?

kripken avatar Dec 01 '25 17:12 kripken

What are the passes involved in the cycle of creating new opportunities here? Perhaps we could do another targeted fix by combining passes or having them run each other as part of their internal iterations.

tlively avatar Dec 01 '25 18:12 tlively

Here, removing the useless blocks only requires --remove-unused-brs --vacuum. But the reason those are the right passes is just a coincidence - there was simply so much work after inlining that it stopped at some "random" point. It could have been some entirely other set of passes needed there, in other words, and it might require cycles of them.

kripken avatar Dec 01 '25 20:12 kripken

Re-measuring this now, I see a 1% code size improvement with adding another -Os after the first, and only 0.03% when just adding the two passes mentioned above that fix the obvious clutter. That is, I think that clutter is much less than 1% - does that sound right @mkustermann or did you measure things differently?

kripken avatar Dec 09 '25 22:12 kripken

@kripken Yes, using -Os --remove-unused-brs --vacuum isn't giving much more than -Os. Yes, using -Os -Os is giving 1% more than -Os.

Though running with -Os --remove-unused-brs --vacuum doesn't seem to actually eliminate all such NOP branches and blocks

% time wasm-opt -g --enable-gc --enable-reference-types --enable-multivalue --enable-exception-handling --enable-nontrapping-float-to-int --enable-sign-ext --enable-bulk-memory --enable-threads '--no-inline=*<noInline>*' --traps-never-happen \
    -Os --remove-unused-brs --vacuum \
    flute.wasm \
    -o flute.Os.removeUnusedBrs.vacuum.wasm

I still see the unused branches, e.g. in

 (func $_InterfaceType.toString (param $var0 (ref $#Top)) (param $var1 i32) (param $var2 (ref $DiagnosticLevel)) (param $var3 (ref null $TextTreeConfiguration)) (result (ref $JSStringImpl))
    (local $var4 (ref $JSStringImpl))
    (local $var5 (ref $JSStringImpl))
    (local $var6 (ref null $JSStringImpl))
    (local $var7 (ref null $_InterfaceType))
    (local $var8 (ref $_Type))
    (local $var9 (ref null $Array<String>))
    (local $var10 (ref $JSStringImpl))
    (local $var11 i64)
    block $label0 (result i32)
      local.get $var0
      ref.cast $_InterfaceType
      local.tee $var7
      struct.get $_InterfaceType $classId
      br $label0
    end $label0
  ...
)
(func $stringCombineHashes (param $var0 i64) (param $var1 i64) (result i64)
    block $label0 (result i64)
      local.get $var0
      local.get $var1
      i64.add
      local.tee $var0
      block $label1 (result i64)
        local.get $var0
        i64.const 10
        i64.shl
        br $label1
      end $label1
      i64.add
      local.tee $var0
      i64.const 4294967295
      i64.and
      i64.const 6
      i64.shr_u
      br $label0
    end $label0
    local.get $var0
    i64.xor
  )

@kripken Do you see the same? I.e. running with -Os --remove-unused-brs --vacuum doesn't remove those NOP blocks & branches?

mkustermann avatar Dec 10 '25 13:12 mkustermann

Yes, you are right, it looks like some of those situations need a little more, as before the binary lowering a drop exists which is dead code, but interferes with those other two passes. Adding --dce fixes both of those for me (for a total of --dce --remove-unused-brs --vacuum).

Sorry, I may have been testing with that when I measured before, and gotten it mixed up. But the code size measurements remain about the same after adding that pass.

kripken avatar Dec 10 '25 17:12 kripken

Thanks, @kripken . Yes I can confirm that using --dce --remove-unused-brs --vaccum gets rid of the NOP blocks and removes around -0.085 % (whereas a second -Os would remove 0.67 %) - so the NOP block are only ~ 12% of savings.

We may just run a second -Os - depending on how much compile time it adds.

mkustermann avatar Dec 15 '25 12:12 mkustermann