Running `wasm-opt` with `-Os` leaves NOP / dead control flow behind
See attached flute.tar.gz
We start with an unoptimized flute.wasm file and run wasm-opt
% wasm-opt -g --enable-gc --enable-reference-types --enable-multivalue --enable-exception-handling --enable-nontrapping-float-to-int --enable-sign-ext --enable-bulk-memory --enable-threads '--no-inline=*<noInline>*' --traps-never-happen \
-Os \
flute.wasm \
-o flute.Os.wasm
This -Os leaves NOP blocks behind:
(func $_InterfaceType.toString ...
...
block $label0 (result i32)
local.get $var0
ref.cast $_InterfaceType
local.tee $var7
struct.get $_InterfaceType $classId
br $label0
end $label0
...
)
It also doesn't prune branches with unreachable (some variants of this):
(func $_FunctionType.toString
...
if (result (ref $JSStringImpl))
local.get $var6
unreachable
else
local.get $var6
block $label5 (result (ref $JSStringImpl))
local.get $var9
i32.const 0
global.get $"C8 DiagnosticLevel"
ref.null none
local.get $var9
struct.get $_Type $field0
i32.const 3066
i32.add
call_indirect (param (ref $#Top) i32 (ref $DiagnosticLevel) (ref null $TextTreeConfiguration)) (result (ref $JSStringImpl))
br $label5
end $label5
call $JSStringImpl.+
end
...
)
If we run another -Os pass then those issues disappear
% wasm-opt -g --enable-gc --enable-reference-types --enable-multivalue --enable-exception-handling --enable-nontrapping-float-to-int --enable-sign-ext --enable-bulk-memory --enable-threads '--no-inline=*<noInline>*' --traps-never-happen \
-Os \
flute.Os.wasm \
-o flute.Os.Os.wasm
From a users point of view, I wouldn't expect an -Os to leave these code patterns behind. We can of course run multiple -Os rounds, but that a) may do more work than needed and b) raises the question of how many -Os passes one may need to run to cleanup things left by previous -Os passes.
This is a current limitation. Most passes will proceed while they see more work to do, but the pass pipeline as a whole does not keep running while one pass opens up work for another. Investigating this case, that is what happens here.
We do have an optional flag --converge, that will keep running all the passes while code size decreases, but it can have unpredictable compile times (and code size isn't always the right metric).
In the long term, we may want a smarter pass manager that sees which passes open up work for others and keeps running them, but there are no immediate plans for that. Running multiple -Os etc. is the simplest workaround for now.
If this is a significant problem, perhaps improving --converge is a short-term option we can consider, but I'd hope this is only a matter of a few bytes here and there?
What are the passes involved in the cycle of creating new opportunities here? Perhaps we could do another targeted fix by combining passes or having them run each other as part of their internal iterations.
Here, removing the useless blocks only requires --remove-unused-brs --vacuum. But the reason those are the right passes is just a coincidence - there was simply so much work after inlining that it stopped at some "random" point. It could have been some entirely other set of passes needed there, in other words, and it might require cycles of them.
Re-measuring this now, I see a 1% code size improvement with adding another -Os after the first, and only 0.03% when just adding the two passes mentioned above that fix the obvious clutter. That is, I think that clutter is much less than 1% - does that sound right @mkustermann or did you measure things differently?
@kripken Yes, using -Os --remove-unused-brs --vacuum isn't giving much more than -Os. Yes, using -Os -Os is giving 1% more than -Os.
Though running with -Os --remove-unused-brs --vacuum doesn't seem to actually eliminate all such NOP branches and blocks
% time wasm-opt -g --enable-gc --enable-reference-types --enable-multivalue --enable-exception-handling --enable-nontrapping-float-to-int --enable-sign-ext --enable-bulk-memory --enable-threads '--no-inline=*<noInline>*' --traps-never-happen \
-Os --remove-unused-brs --vacuum \
flute.wasm \
-o flute.Os.removeUnusedBrs.vacuum.wasm
I still see the unused branches, e.g. in
(func $_InterfaceType.toString (param $var0 (ref $#Top)) (param $var1 i32) (param $var2 (ref $DiagnosticLevel)) (param $var3 (ref null $TextTreeConfiguration)) (result (ref $JSStringImpl))
(local $var4 (ref $JSStringImpl))
(local $var5 (ref $JSStringImpl))
(local $var6 (ref null $JSStringImpl))
(local $var7 (ref null $_InterfaceType))
(local $var8 (ref $_Type))
(local $var9 (ref null $Array<String>))
(local $var10 (ref $JSStringImpl))
(local $var11 i64)
block $label0 (result i32)
local.get $var0
ref.cast $_InterfaceType
local.tee $var7
struct.get $_InterfaceType $classId
br $label0
end $label0
...
)
(func $stringCombineHashes (param $var0 i64) (param $var1 i64) (result i64)
block $label0 (result i64)
local.get $var0
local.get $var1
i64.add
local.tee $var0
block $label1 (result i64)
local.get $var0
i64.const 10
i64.shl
br $label1
end $label1
i64.add
local.tee $var0
i64.const 4294967295
i64.and
i64.const 6
i64.shr_u
br $label0
end $label0
local.get $var0
i64.xor
)
@kripken Do you see the same? I.e. running with -Os --remove-unused-brs --vacuum doesn't remove those NOP blocks & branches?
Yes, you are right, it looks like some of those situations need a little more, as before the binary lowering a drop exists which is dead code, but interferes with those other two passes. Adding --dce fixes both of those for me (for a total of --dce --remove-unused-brs --vacuum).
Sorry, I may have been testing with that when I measured before, and gotten it mixed up. But the code size measurements remain about the same after adding that pass.
Thanks, @kripken . Yes I can confirm that using --dce --remove-unused-brs --vaccum gets rid of the NOP blocks and removes around -0.085 % (whereas a second -Os would remove 0.67 %) - so the NOP block are only ~ 12% of savings.
We may just run a second -Os - depending on how much compile time it adds.