Slow optimizations on Dart binary
There is a dart binary that takes extremely long to optimize with this command:
wasm-opt -g -all --closed-world -tnh --type-unfinalizing -O3 --type-ssa --gufa -O3 --type-merging -O1 --type-finalizing
Most passes run in under a second, except that in the first -O3:
- ssa-nomerge takes 90s
- precompute-propagate takes 104s
- heap2local takes 55s
- local-subtyping takes 56s
- code-folding takes 77s
- dae-optimizing takes 361s
- optimize-stack-ir takes 36s
Overall, the first -O3 takes 836s. After that GUFA takes 31s and the second -O3 takes 410s. The -O1 takes 10s, spending most of its time in two executions of coalesce-locals that take 3s each.
code-folding is odd as it doesn't use LocalGraph, but all the others do and are explained by that.
This module has a function with over 100 K locals and over 300 K local.gets, so LocalGraph has a lot of work to do.
Possible optimization ideas:
- Running
-O1first removes all those locals, but leaves the param and thelocal.gets, and it is still slow. - Noting there are no
local.sets for the param lets us skip all the block flowing. - Even with that, however, we do end up doing over 300 K inserts into an
unordered_map<Index, ..>which is apparently quite slow.
Can we replace the unordered_map with index keys with a simple array / vector?
Not easily, as we'd need to build an indexing for the LocalGet* etc. pointers.
But, meanwhile it seems the larger issue by far is that we can just skip param flowing entirely, as done in #6046.
A remaining issue is that the sheer number of locals in the function makes us slow nonetheless, but running -O1 first solves that. In theory perhaps -O3 should run those passes before the first LocalGraph-using pass, but that would add significant work that usually won't help, I worry.