binaryen icon indicating copy to clipboard operation
binaryen copied to clipboard

wasm-opt performance

Open arsnyder16 opened this issue 3 years ago • 16 comments

I am working what would expect to be a rather large c++ code based and compiling for webassembly

One thing i have noticed during the process is the performance of wasm-opt across my .wasm file. I am wondering if there is anything i can do to help profile where the bottlenecks are in wasm-opt and if we can address them.

Here is breakdown of file size, and run time of wasm-opt baseline perf.wasm -> 419.88MB O0 38.6 MB O1 33.6 MB O2 29.3 MB O3 30.7 MB Os 28.7 MB <- Our current build optimization level Oz 28.3 MB

I am using the following script
ls -l perf.wasm
for OPT in 0 1 2 3 s z
do
  time /root/emsdk/upstream/bin/wasm-opt -O$OPT --strip-dwarf --post-emscripten --low-memory-unused --zero-filled-memory --strip-debug --strip-producers \
    perf.wasm -o perfO$OPT.wasm --mvp-features --enable-threads --enable-mutable-globals --enable-bulk-memory --enable-sign-ext
  ls -l perfO$OPT.wasm
  sleep 15s
done

Here is the output

-rw-r--r-- 1 root root 440277504 Apr 27 08:52 perf.wasm

real    0m12.978s
user    0m17.046s
sys     0m1.136s
-rw-r--r-- 1 root root 40548421 Apr 27 09:21 perfO0.wasm

real    0m28.735s
user    1m47.963s
sys     0m1.555s
-rw-r--r-- 1 root root 35313836 Apr 27 09:22 perfO1.wasm

real    1m44.731s
user    7m1.825s
sys     0m1.861s
-rw-r--r-- 1 root root 30765732 Apr 27 09:24 perfO2.wasm

real    2m15.277s
user    11m46.021s
sys     0m1.842s
-rw-r--r-- 1 root root 32263192 Apr 27 09:26 perfO3.wasm

real    2m2.222s
user    9m3.411s
sys     0m1.860s
-rw-r--r-- 1 root root 30136226 Apr 27 09:29 perfOs.wasm

real    1m59.970s
user    9m23.343s
sys     0m2.571s
-rw-r--r-- 1 root root 29763508 Apr 27 09:31 perfOz.wasm

arsnyder16 avatar Apr 27 '22 14:04 arsnyder16

That would be great!

For profiling, some existing work is in https://github.com/WebAssembly/binaryen/issues/4165 . I added a comment there now to mention how the measurements there were gathered - there is a simple way to get profiling data for each pass, which narrows things down to the pass level at least. Aside from that, using a system profiler like perf can be very useful (running either on all the passes, or on an individual pass).

kripken avatar Apr 27 '22 15:04 kripken

Great! Thanks here is my -Os output

[PassRunner]   running pass: duplicate-function-elimination...     14.2522 seconds.
[PassRunner]   running pass: memory-packing...                     0.423822 seconds.
[PassRunner]   running pass: once-reduction...                     0.322653 seconds.
[PassRunner]   running pass: ssa-nomerge...                        8.5075 seconds.
[PassRunner]   running pass: dce...                                2.57108 seconds.
[PassRunner]   running pass: remove-unused-names...                0.444138 seconds.
[PassRunner]   running pass: remove-unused-brs...                  2.37229 seconds.
[PassRunner]   running pass: remove-unused-names...                0.427256 seconds.
[PassRunner]   running pass: optimize-instructions...              1.80523 seconds.
[PassRunner]   running pass: pick-load-signs...                    0.601157 seconds.
[PassRunner]   running pass: precompute...                         1.1791 seconds.
[PassRunner]   running pass: optimize-added-constants-propagate... 15.1376 seconds.
[PassRunner]   running pass: code-pushing...                       0.81729 seconds.
[PassRunner]   running pass: simplify-locals-nostructure...        4.17784 seconds.
[PassRunner]   running pass: vacuum...                             3.45842 seconds.
[PassRunner]   running pass: reorder-locals...                     0.595092 seconds.
[PassRunner]   running pass: remove-unused-brs...                  1.62962 seconds.
[PassRunner]   running pass: coalesce-locals...                    3.61735 seconds.
[PassRunner]   running pass: local-cse...                          1.98886 seconds.
[PassRunner]   running pass: simplify-locals...                    4.76807 seconds.
[PassRunner]   running pass: vacuum...                             3.14018 seconds.
[PassRunner]   running pass: reorder-locals...                     0.548554 seconds.
[PassRunner]   running pass: coalesce-locals...                    3.39823 seconds.
[PassRunner]   running pass: reorder-locals...                     0.610493 seconds.
[PassRunner]   running pass: vacuum...                             3.13448 seconds.
[PassRunner]   running pass: code-folding...                       1.47137 seconds.
[PassRunner]   running pass: merge-blocks...                       1.02699 seconds.
[PassRunner]   running pass: remove-unused-brs...                  1.99127 seconds.
[PassRunner]   running pass: remove-unused-names...                0.552183 seconds.
[PassRunner]   running pass: merge-blocks...                       1.1115 seconds.
[PassRunner]   running pass: precompute...                         1.17952 seconds.
[PassRunner]   running pass: optimize-instructions...              1.56329 seconds.
[PassRunner]   running pass: rse...                                3.74092 seconds.
[PassRunner]   running pass: vacuum...                             3.0541 seconds.
[PassRunner]   running pass: dae-optimizing...                     59.6788 seconds.
[PassRunner]   running pass: inlining-optimizing...                20.4154 seconds.
[PassRunner]   running pass: duplicate-function-elimination...     1.42484 seconds.
[PassRunner]   running pass: duplicate-import-elimination...       0.0266978 seconds.
[PassRunner]   running pass: simplify-globals-optimizing...        0.25057 seconds.
[PassRunner]   running pass: remove-unused-module-elements...      0.771398 seconds.
[PassRunner]   running pass: directize...                          0.115832 seconds.
[PassRunner]   running pass: generate-stack-ir...                  0.540637 seconds.
[PassRunner]   running pass: optimize-stack-ir...                  10.2448 seconds.
[PassRunner]   running pass: strip-dwarf...                        0.367992 seconds.
[PassRunner]   running pass: post-emscripten...                    0.595371 seconds.
[PassRunner]   running pass: strip-debug...                        0.0243991 seconds.
[PassRunner]   running pass: strip-producers...                    0.0179393 seconds.
[PassRunner] passes took 190.094 seconds.

5 slowest

[PassRunner]   running pass: duplicate-function-elimination...     14.2522 seconds.
[PassRunner]   running pass: optimize-added-constants-propagate... 15.1376 seconds.
[PassRunner]   running pass: dae-optimizing...                     59.6788 seconds.
[PassRunner]   running pass: inlining-optimizing...                20.4154 seconds.
[PassRunner]   running pass: optimize-stack-ir...                  10.2448 seconds.

arsnyder16 avatar Apr 27 '22 15:04 arsnyder16

Interesting about dae-optimizing. I believe it will do an arbitrary number of iterations, so maybe your codebase somehow ends up running a lot of them. That is, somehow after finding some function arguments are dead, it optimizes and finds even more, and so forth.

Adding some manual logging in the pass might help see if that's the issue. If it is, maybe we could add a hard limit on the total number of iterations, or something like that.

kripken avatar Apr 27 '22 16:04 kripken

Thanks for the pointers, pulled the repo and built, going to dig in a little and see what i find

arsnyder16 avatar Apr 27 '22 16:04 arsnyder16

@kripken I am close with some improvements specifically for dae-optimizing .

Just so I can get an accurate before and after what is the toolkit used for the prebuilt binaries shipped with emscripten? Compiler and STL?

arsnyder16 avatar Apr 28 '22 20:04 arsnyder16

We build libcxx but I'm not sure we use it on all platforms (might use the system one).

Here is an example CI log of where we build it. That also has the id of the compiler, which looks like clang 15:

https://logs.chromium.org/logs/emscripten-releases/buildbucket/cr-buildbucket/8815612072989849729/+/u/Build_libcxx/stdout

Same compiler should be used on all the other parts of the build, but may vary by OS, I'm not sure. Reading some more build logs might have more info,

https://ci.chromium.org/p/emscripten-releases

Here is the logic that creates the builds:

https://chromium.googlesource.com/emscripten-releases/+/refs/heads/main/src/build.py

Overall though, I'd hope any compiler speedups don't depend on the compiler or STL much. If you measure with the same ones before and after, that should be enough. If we are not sure we can do more measurements on other machines.

kripken avatar Apr 28 '22 21:04 kripken

Thanks this is helpful, there is a difference between the two, thats why i asked since i noticed my times were off once i started building myself

For example on my Ubuntu 18.04 WSL using g++ 9.4 and libstdc++

...
[PassRunner]   running pass: dae-optimizing...                     63.0937 seconds.
...
[PassRunner] passes took 201.335 seconds.

real	3m32.960s
user	8m47.119s
sys	0m1.302s

for clang 14 and libc++

...
[PassRunner]   running pass: dae-optimizing...                     53.1484 seconds.
...
[PassRunner] passes took 185.137 seconds.

real	3m16.386s
user	8m22.423s
sys	0m1.652s

arsnyder16 avatar Apr 29 '22 01:04 arsnyder16

@kripken With the release of firefox 100 wasm exceptions seems to now be supported by all major browsers by default. So i was switching my build process to use wasm exceptions. Interestingly i found some new slow optimizations that only seems to appear with exception support

[PassRunner]   running pass: rse...                                296.115 seconds.
[PassRunner]   running pass: inlining-optimizing...                752.966 seconds.

arsnyder16 avatar May 04 '22 18:05 arsnyder16

@arsnyder16 Interesting, it's not obvious to me why that happens... though obviously codegen is very different in this case, so it could be anything related to that, like maybe the extra try-catches make the CFG much more complex.

The rse case is likely the easier to debug by far (inlining-optimizing does a lot more, and actually calls rse, so it could be a single cause). Maybe something obvious shows up there when profiling? (Can you share the wasm file?)

It may be worth seeing the effects of https://github.com/WebAssembly/binaryen/pull/4618 , btw. It would be interesting if on your codebase that optimization ends up helping more than on the ones we've tested, and perhaps it actually makes rse afterwards faster...

cc @aheejin - I wonder if maybe it's worth landing https://github.com/WebAssembly/binaryen/pull/4618 but not enabling it by default? It could make investigations like these easier maybe.

kripken avatar May 05 '22 16:05 kripken

@kripken Thanks thats a good head start know somewhere rse might be the common bottleneck, based on initial profiling nothing was glaring. I am going to continue to investigate

No luck with #4618 i added exception-opts as the first optimization to run. Let me know if you think i should try it elsewhere.

I will get back to you about sending you our wasm. Not my call. Ideally i can send you both with and without -fwasm-exceptions, as they are both mentioned in this issue

arsnyder16 avatar May 05 '22 17:05 arsnyder16

I have it isolated to a single function that spends about 4 minutes in RedundantSetElimination::flowValues. Going to try and isolate it into a simple repro

arsnyder16 avatar May 05 '22 18:05 arsnyder16

@kripken Here is an example that replicas the problem, not to the same extreme but the issue is still show. Basically a function with a large map that is use as a look up.

I abstracted away the contents of the map so its really non sensical, but it does replicate the issue.

rse.tar.gz

arsnyder16 avatar May 06 '22 12:05 arsnyder16

Thanks for the testcase. It looks like it gets a lot slower with any type of exceptions:

mode time local.gets local.sets blocks
no exceptions 0.1 5997 2301 442
emscripten exceptions 27.1 14420 10687 3107
wasm exceptions 33.6 21304 4957 2577

RSE operates on locals, so the much larger numbers of local instructions are likely the cause, combined with the more complicated cfg (try-catch bodies or checks after calls, etc.). The number of iterations on basic blocks in the main loop there is a few hundred without exceptions, 500K with emscripten exceptions and 1,396K with wasm exceptions. Possibly the algorithm could be rewritten to not allow a single local update to cause the update in an entire block, though doing it the other way (finish one local entirely before moving on to the other has) downsides too.

kripken avatar May 06 '22 22:05 kripken

@kripken Few more observation, instead of using an initializer list , switching to just emplace and the overhead basically goes away. Makes me wonder why the emitted code of the initializer is so much different and curious is maybe another optimization run before RSE would be able to optimize this ahead of time?

Just using emplace
const std::map<std::string, Data>* GetData() {
  static std::map<std::string, Data>* data = nullptr;
  if (data == nullptr) {
    data = new std::map<std::string, Data>();
    data->emplace("?", Data{0, kV});
    data->emplace("uamo", Data {1, kK});
    data->emplace("uasa", Data {2, kH});
    data->emplace("ucf", Data {3, kK});
    data->emplace("ucgo", Data {4, kE});
    data->emplace("ucgt", Data {5, kE});
    data->emplace("uctm", Data {6, kE});
    data->emplace("uctp", Data {7, kE});
    data->emplace("udd", Data {8, kR});
    data->emplace("udf", Data {9, kK});
    data->emplace("ultt", Data {10, kI});
    data->emplace("unom", Data {11, kC});
    data->emplace("unov", Data {12, kC});
    data->emplace("urde", Data {13, kH});
    data->emplace("urgo", Data {14, kE});
    data->emplace("urgt", Data {15, kE});
    data->emplace("urgr", Data {16, kS});
    data->emplace("urim", Data {17, kK});
    data->emplace("urtm", Data {18, kE});
    data->emplace("urtp", Data {19, kE});
    data->emplace("utar", Data {20, kK});
    data->emplace("utcl", Data {21, kE});
    data->emplace("utga", Data {22, kH});
    data->emplace("utol", Data {23, kP});
    data->emplace("utre", Data {24, kE});
    data->emplace("uver", Data {25, kR});
    data->emplace("vase", Data {26, kR});
    data->emplace("vbde", Data {27, kD});
    data->emplace("vcap", Data {28, kH});
    data->emplace("vcgo", Data {29, kE});
    data->emplace("vcgt", Data {30, kE});
    data->emplace("vcpr", Data {31, kE});
    data->emplace("vctm", Data {32, kE});
    data->emplace("vctr", Data {33, kE});
    data->emplace("vffa", Data {34, kD});
    data->emplace("vfit", Data {35, kB});
    data->emplace("vgfa", Data {36, kD});
    data->emplace("voxc", Data {37, kG});
    data->emplace("voxp", Data {38, kS});
    data->emplace("vreg", Data {39, kB});
    data->emplace("vrgo", Data {40, kE});
    data->emplace("vrgt", Data {41, kE});
    data->emplace("vrie", Data {42, kV});
    data->emplace("vrpr", Data {43, kE});
    data->emplace("vrsr", Data {44, kD});
    data->emplace("vrtm", Data {45, kE});
    data->emplace("vrtr", Data {46, kE});
    data->emplace("vscr", Data {47, kD});
    data->emplace("vtft", Data {48, kR});
    data->emplace("vtin", Data {49, kE});
    data->emplace("vtpr", Data {50, kR});
    data->emplace("vttm", Data {51, kR});
    data->emplace("vubb", Data {52, kS});
    data->emplace("vwca", Data {53, kH});
    data->emplace("vwch", Data {54, kH});
    data->emplace("vwsi", Data {55, kH});
    data->emplace("aa", Data {56, kJ});
    data->emplace("aapa", Data {57, kH});
    data->emplace("aate", Data {58, kV});
    data->emplace("acde", Data {60, kD});
    data->emplace("acf", Data {61, kK});
    data->emplace("acha", Data {62, kG});
    data->emplace("ad", Data {63, kV});
    data->emplace("adf", Data {64, kR});
    data->emplace("adsd", Data {65, kD});
    data->emplace("aent", Data {66, kR});
    data->emplace("afau", Data {67, kU});
    data->emplace("affd", Data {68, kD});
    data->emplace("afma", Data {69, kU});
    data->emplace("afno", Data {70, kU});
    data->emplace("afor", Data {71, kQ});
    data->emplace("agfd", Data {72, kD});
    data->emplace("ahar", Data {73, kS});
    data->emplace("aluo", Data {74, kJ});
    data->emplace("aluv", Data {75, kJ});
    data->emplace("amxd", Data {76, kD});
    data->emplace("aoad", Data {77, kD});
    data->emplace("aode", Data {78, kQ});
    data->emplace("aomp", Data {79, kC});
    data->emplace("aonc", Data {80, kQ});
    data->emplace("aont", Data {81, kS});
    data->emplace("aonv", Data {82, kQ});
    data->emplace("aopy", Data {83, kQ});
    data->emplace("aorr", Data {84, kA});
    data->emplace("aoun", Data {85, kQ});
    data->emplace("aova", Data {86, kA});
    data->emplace("aoxr", Data {87, kI});
    data->emplace("arsd", Data {88, kD});
    data->emplace("aspd", Data {89, kD});
    data->emplace("atad", Data {90, kE});
    data->emplace("atin", Data {91, kE});
    data->emplace("atol", Data {100, kP});
    data->emplace("atpr", Data {101, kE});
    data->emplace("atre", Data {102, kE});
    data->emplace("ausu", Data {103, kG});
    data->emplace("yate", Data {104, kQ});
    data->emplace("ycap", Data {105, kH});
    data->emplace("yeco", Data {106, kK});
    data->emplace("yefi", Data {107, kR});
    data->emplace("yeft", Data {108, kG});
    data->emplace("yele", Data {109, kQ});
    data->emplace("yes", Data {110, kK});
    data->emplace("yesc", Data {111, kA});
    data->emplace("yiag", Data {112, kR});
    data->emplace("yiff", Data {113, kK});
    data->emplace("yisc", Data {114, kR});
    data->emplace("yivi", Data {115, kR});
    data->emplace("yotp", Data {116, kS});
    data->emplace("yplo", Data {117, kS});
    data->emplace("yrou", Data {118, kQ});
    data->emplace("ysde", Data {119, kD});
    data->emplace("yset", Data {120, kR});
    data->emplace("ytes", Data {121, kI});
    data->emplace("ytyp", Data {122, kV});
    data->emplace("ecdf", Data {123, kS});
    data->emplace("echo", Data {124, kV});
    data->emplace("eige", Data {125, kR});
    data->emplace("endl", Data {126, kV});
    data->emplace("endm", Data {127, kV});
    data->emplace("eras", Data {128, kQ});
    data->emplace("etes", Data {129, kI});
    data->emplace("evde", Data {130, kD});
    data->emplace("ewma", Data {131, kG});
    data->emplace("exec", Data {132, kV});
    data->emplace("exit", Data {133, kV});
    data->emplace("cacp", Data {134, kF});
    data->emplace("cact", Data {135, kJ});
    data->emplace("cdat", Data {136, kQ});
    data->emplace("cdes", Data {137, kD});
    data->emplace("cfac", Data {138, kD});
    data->emplace("cfcu", Data {139, kD});
    data->emplace("cfde", Data {140, kD});
    data->emplace("cfin", Data {141, kD});
    data->emplace("cfma", Data {142, kD});
    data->emplace("cish", Data {143, kH});
    data->emplace("citl", Data {144, kB});
    data->emplace("cnum", Data {145, kQ});
    data->emplace("corm", Data {146, kU});
    data->emplace("crie", Data {147, kM});
    data->emplace("ctex", Data {148, kQ});
    data->emplace("gage", Data {149, kH});
    data->emplace("gawo", Data {150, kH});
    data->emplace("gcha", Data {151, kG});
    data->emplace("gcor", Data {152, kS});
    data->emplace("genv", Data {153, kG});
    data->emplace("gfac", Data {154, kD});
    data->emplace("glm", Data  {155, kC});
    data->emplace("gsca", Data {156, kV});
    data->emplace("gsum", Data {157, kA});
    data->emplace("gzlm", Data {158, kB});
    data->emplace("hcut", Data {159, kV});
    data->emplace("help", Data {160, kV});
    data->emplace("hist", Data {161, kS});
    data->emplace("hmap", Data {162, kS});
    data->emplace("icha", Data {163, kG});
    data->emplace("idid", Data {164, kI});
    data->emplace("idov", Data {165, kI});
    data->emplace("iect", Data {166, kE});
    data->emplace("iert", Data {167, kE});
    data->emplace("imrc", Data {168, kH});
    data->emplace("indi", Data {169, kR});
    data->emplace("indp", Data {170, kS});
    data->emplace("info", Data {171, kV});
    data->emplace("inte", Data {172, kC});
    data->emplace("intp", Data {173, kS});
    data->emplace("invc", Data {174, kR});
    data->emplace("inve", Data {175, kR});
    data->emplace("item", Data {176, kJ});
    data->emplace("john", Data {177, kH});
    data->emplace("kkca", Data {178, kV});
    data->emplace("kkna", Data {179, kV});
    data->emplace("kkse", Data {180, kV});
    data->emplace("kmea", Data {181, kJ});
    data->emplace("krus", Data {182, kM});
    data->emplace("lag", Data  {183, kK});
    data->emplace("layo", Data {184, kV});
    data->emplace("let", Data  {185, kR});
    data->emplace("lnga", Data {186, kH});
    data->emplace("long", Data {187, kH});
    data->emplace("lplo", Data {188, kS});
    data->emplace("lreg", Data {189, kI});
    data->emplace("lrpr", Data {190, kB});
    data->emplace("ltab", Data {191, kI});
    data->emplace("ltes", Data {192, kI});
    data->emplace("ba", Data   {193, kK});
    data->emplace("bach", Data {194, kG});
    data->emplace("bain", Data {195, kC});
    data->emplace("bann", Data {196, kM});
    data->emplace("bano", Data {197, kC});
    data->emplace("barg", Data {198, kS});
    data->emplace("bars", Data {199, kE});
    data->emplace("batr", Data {200, kS});
    data->emplace("baxi", Data {201, kR});
    data->emplace("bc", Data   {202, kV});
    data->emplace("bca", Data  {203, kJ});
    data->emplace("bcap", Data {204, kH});
    data->emplace("bcol", Data {205, kV});
    data->emplace("bcon", Data {206, kV});
    data->emplace("bdes", Data {207, kD});
    data->emplace("bean", Data {208, kR});
    data->emplace("bedi", Data {209, kR});
    data->emplace("berg", Data {210, kQ});
    data->emplace("besh", Data {211, kR});
    data->emplace("bewm", Data {212, kG});
    data->emplace("bffc", Data {213, kD});
    data->emplace("bgag", Data {214, kH});
    data->emplace("bini", Data {215, kR});
    data->emplace("bixc", Data {216, kD});
    data->emplace("bixo", Data {217, kD});
    data->emplace("bixr", Data {218, kD});
    data->emplace("bixs", Data {219, kD});
    data->emplace("blag", Data {220, kK});
    data->emplace("bmat", Data {221, kV});
    data->emplace("bmop", Data {222, kF});
    data->emplace("bmpr", Data {223, kD});
    data->emplace("bnca", Data {224, kH});
    data->emplace("bood", Data {225, kM});
    data->emplace("bove", Data {226, kF});
    data->emplace("brch", Data {227, kG});
    data->emplace("brop", Data {228, kD});
    data->emplace("brpr", Data {229, kB});
    data->emplace("bsur", Data {230, kF});
    data->emplace("btit", Data {231, kV});
    data->emplace("btyp", Data {232, kV});
    data->emplace("bult", Data {233, kR});
    data->emplace("bvar", Data {234, kH});
    data->emplace("byme", Data {235, kV});
    data->emplace("n", Data    {236, kR});
    data->emplace("name", Data {237, kV});
    data->emplace("nbox", Data {238, kS});
    data->emplace("ncha", Data {239, kS});
    data->emplace("nest", Data {240, kC});
    data->emplace("new", Data  {241, kV});
    data->emplace("ngro", Data {242, kI});
    data->emplace("nhis", Data {243, kS});
    data->emplace("nind", Data {244, kS});
    data->emplace("nlin", Data {245, kB});
    data->emplace("nlog", Data {246, kB});
    data->emplace("nmis", Data {247, kR});
    data->emplace("nmva", Data {248, kH});
    data->emplace("nnca", Data {249, kH});
    data->emplace("nnsi", Data {250, kH});
    data->emplace("nnti", Data {251, kH});
    data->emplace("nobr", Data {252, kV});
    data->emplace("noec", Data {253, kV});
    data->emplace("noou", Data {254, kV});
    data->emplace("nopr", Data {255, kV});
    data->emplace("norm", Data {256, kA});
    data->emplace("note", Data {257, kV});
    data->emplace("noyi", Data {258, kV});
    data->emplace("npch", Data {259, kG});
    data->emplace("nplo", Data {260, kS});
    data->emplace("ntga", Data {261, kH});
    data->emplace("ntsp", Data {262, kS});
    data->emplace("nume", Data {263, kQ});
    data->emplace("oade", Data {264, kD});
    data->emplace("oapr", Data {265, kD});
    data->emplace("odbc", Data {266, kV});
    data->emplace("olog", Data {267, kB});
    data->emplace("oner", Data {268, kA});
    data->emplace("onet", Data {269, kA});
    data->emplace("onev", Data {270, kA});
    data->emplace("onew", Data {271, kC});
    data->emplace("onez", Data {272, kA});
    data->emplace("optd", Data {273, kD});
    data->emplace("oreg", Data {274, kB});
    data->emplace("outf", Data {275, kV});
    data->emplace("outl", Data {276, kA});
    data->emplace("over", Data {277, kS});
    data->emplace("ow", Data   {278, kV});
    data->emplace("lacf", Data {279, kK});
    data->emplace("lair", Data {280, kA});
    data->emplace("lare", Data {281, kH});
    data->emplace("laus", Data {282, kV});
    data->emplace("larp", Data {283, kS});
    data->emplace("lbde", Data {284, kD});
    data->emplace("lca", Data  {285, kJ});
    data->emplace("lcap", Data {286, kH});
    data->emplace("lcha", Data {287, kG});
    data->emplace("ldf", Data  {288, kR});
    data->emplace("ldia", Data {289, kG});
    data->emplace("lgoo", Data {290, kA});
    data->emplace("lgro", Data {291, kI});
    data->emplace("liec", Data {292, kS});
    data->emplace("llot", Data {293, kS});
    data->emplace("lls", Data  {294, kB});
    data->emplace("lmpr", Data {295, kB});
    data->emplace("lltx", Data {296, kS});
    data->emplace("lone", Data {297, kA});
    data->emplace("lowe", Data {298, kO});
    data->emplace("lplo", Data {299, kS});
    data->emplace("lpri", Data {300, kG});
    data->emplace("lred", Data {301, kF});
    data->emplace("lrep", Data {302, kD});
    data->emplace("lrin", Data {303, kV});
    data->emplace("lrob", Data {304, kI});
    data->emplace("lroc", Data {305, kP});
    data->emplace("lrod", Data {306, kP});
    data->emplace("lrof", Data {307, kV});
    data->emplace("ltwo", Data {308, kA});
    data->emplace("lysc", Data {309, kV});
    data->emplace("yuit", Data {310, kV});
    data->emplace("ar", Data    {311, kV});
    data->emplace("arand", Data {312, kR});
    data->emplace("arang", Data {313, kR});
    data->emplace("arank", Data {313, kQ});
    data->emplace("arcfo", Data {314, kE});
    data->emplace("archa", Data {315, kG});
    data->emplace("arcou", Data {316, kR});
    data->emplace("arcpr", Data {317, kE});
    data->emplace("aresu", Data {318, kV});
    data->emplace("ardid", Data {319, kI});
    data->emplace("ardov", Data {320, kI});
    data->emplace("aread", Data {321, kV});
    data->emplace("aregr", Data {322, kB});
    data->emplace("areml", Data {323, kB});
    data->emplace("arest", Data {324, kV});
    data->emplace("aretr", Data {325, kV});
    data->emplace("arfor", Data {326, kU});
    data->emplace("armax", Data {327, kR});
    data->emplace("armco", Data {328, kF});
    data->emplace("armea", Data {329, kR});
    data->emplace("armed", Data {330, kR});
    data->emplace("armer", Data {331, kQ});
    data->emplace("armin", Data {332, kR});
    data->emplace("arn", Data   {333, kR});
    data->emplace("arnga", Data {334, kH});
    data->emplace("arnmi", Data {335, kR});
    data->emplace("arnmn", Data {336, kR});
    data->emplace("arnpr", Data {337, kR});
    data->emplace("arntm", Data {338, kR});
    data->emplace("arobu", Data {339, kD});
    data->emplace("arowt", Data {340, kQ});
    data->emplace("arran", Data {341, kR});
    data->emplace("arrfo", Data {342, kE});
    data->emplace("arrpr", Data {343, kE});
    data->emplace("arsco", Data {344, kS});
    data->emplace("arscr", Data {345, kV});
    data->emplace("arsre", Data {346, kD});
    data->emplace("arssq", Data {347, kR});
    data->emplace("arstd", Data {348, kR});
    data->emplace("arsum", Data {349, kR});
    data->emplace("artin", Data {350, kE});
    data->emplace("artpr", Data {351, kE});
    data->emplace("artre", Data {352, kE});
    data->emplace("arunc", Data {353, kH});
    data->emplace("aruns", Data {354, kR});
    data->emplace("lave", Data {355, kV});
    data->emplace("lcde", Data {356, kD});
    data->emplace("lcha", Data {357, kG});
    data->emplace("lcre", Data {358, kD});
    data->emplace("les", Data  {359, kK});
    data->emplace("let", Data  {360, kR});
    data->emplace("lhel", Data {370, kB});
    data->emplace("limp", Data {371, kD});
    data->emplace("lint", Data {372, kM});
    data->emplace("lixp", Data {373, kH});
    data->emplace("llde", Data {374, kD});
    data->emplace("lort", Data {375, kQ});
    data->emplace("lpco", Data {376, kD});
    data->emplace("lpde", Data {377, kD});
    data->emplace("lpfa", Data {378, kD});
    data->emplace("lpli", Data {379, kQ});
    data->emplace("lpsi", Data {380, kD});
    data->emplace("lsci", Data {381, kO});
    data->emplace("lsq", Data  {382, kR});
    data->emplace("lsti", Data {383, kO});
    data->emplace("lswo", Data {384, kB});
    data->emplace("ltac", Data {385, kQ});
    data->emplace("ltat", Data {386, kA});
    data->emplace("ltd", Data  {387, kR});
    data->emplace("ltde", Data {389, kR});
    data->emplace("ltem", Data {390, kS});
    data->emplace("ltes", Data {391, kM});
    data->emplace("ltop", Data {392, kV});
    data->emplace("lubs", Data {393, kQ});
    data->emplace("lubt", Data {394, kV});
    data->emplace("lum", Data  {395, kR});
    data->emplace("lurf", Data {396, kS});
    data->emplace("lymp", Data {397, kH});
    data->emplace("tabl", Data {398, kL});
    data->emplace("tall", Data {399, kL});
    data->emplace("tbox", Data {400, kK});
    data->emplace("tcha", Data {401, kG});
    data->emplace("tchi", Data {402, kL});
    data->emplace("tdpr", Data {403, kI});
    data->emplace("text", Data {404, kQ});
    data->emplace("toga", Data {405, kH});
    data->emplace("toli", Data {406, kH});
    data->emplace("tost", Data {407, kN});
    data->emplace("trac", Data {408, kD});
    data->emplace("tran", Data {409, kQ});
    data->emplace("tren", Data {410, kK});
    data->emplace("tset", Data {411, kR});
    data->emplace("tsgv", Data {412, kG});
    data->emplace("tspl", Data {413, kS});
    data->emplace("tsqu", Data {414, kG});
    data->emplace("tswi", Data {415, kK});
    data->emplace("twar", Data {416, kI});
    data->emplace("twor", Data {417, kA});
    data->emplace("twos", Data {418, kA});
    data->emplace("twot", Data {419, kA});
    data->emplace("twov", Data {420, kA});
    data->emplace("twow", Data {421, kC});
    data->emplace("ucha", Data {422, kG});
    data->emplace("udia", Data {423, kG});
    data->emplace("unst", Data {424, kQ});
    data->emplace("upri", Data {425, kG});
    data->emplace("vart", Data {426, kC});
    data->emplace("vasa", Data {427, kH});
    data->emplace("vdes", Data {428, kV});
    data->emplace("vfac", Data {429, kD});
    data->emplace("vmas", Data {430, kG});
    data->emplace("vord", Data {431, kV});
    data->emplace("vpre", Data {432, kD});
    data->emplace("wals", Data {433, kM});
    data->emplace("wdes", Data {434, kV});
    data->emplace("wdif", Data {435, kM});
    data->emplace("wint", Data {436, kM});
    data->emplace("wope", Data {437, kV});
    data->emplace("work", Data {438, kV});
    data->emplace("wpre", Data {439, kI});
    data->emplace("writ", Data {440, kV});
    data->emplace("wsav", Data {441, kV});
    data->emplace("wslo", Data {442, kM});
    data->emplace("wsta", Data {443, kQ});
    data->emplace("wtes", Data {444, kM});
    data->emplace("arach", Data {445, kT});
    data->emplace("aratg", Data {446, kT});
    data->emplace("arbar", Data {447, kG});
    data->emplace("arbca", Data {448, kT});
    data->emplace("arbim", Data {449, kT});
    data->emplace("arbox", Data {450, kT});
    data->emplace("arbpc", Data {451, kT});
    data->emplace("arbuc", Data {452, kT});
    data->emplace("arbxr", Data {453, kT});
    data->emplace("arbxs", Data {454, kT});
    data->emplace("arcap", Data {455, kT});
    data->emplace("archa", Data {456, kT});
    data->emplace("ardac", Data {457, kV});
    data->emplace("ardad", Data {458, kV});
    data->emplace("ardde", Data {459, kV});
    data->emplace("ardex", Data {460, kV});
    data->emplace("ardge", Data {461, kV});
    data->emplace("ardpo", Data {462, kV});
    data->emplace("ardre", Data {463, kV});
    data->emplace("arffa", Data {464, kT});
    data->emplace("arffd", Data {465, kT});
    data->emplace("argag", Data {466, kT});
    data->emplace("argic", Data {467, kT});
    data->emplace("argim", Data {468, kT});
    data->emplace("argpc", Data {469, kT});
    data->emplace("argsu", Data {470, kT});
    data->emplace("arguc", Data {471, kT});
    data->emplace("argxr", Data {472, kT});
    data->emplace("arhis", Data {473, kT});
    data->emplace("arimr", Data {474, kT});
    data->emplace("arind", Data {475, kT});
    data->emplace("arint", Data {476, kT});
    data->emplace("armai", Data {477, kT});
    data->emplace("armat", Data {478, kT});
    data->emplace("armfd", Data {479, kT});
    data->emplace("armre", Data {480, kT});
    data->emplace("aroca", Data {481, kT});
    data->emplace("arone", Data {482, kT});
    data->emplace("arpai", Data {483, kT});
    data->emplace("arpar", Data {484, kT});
    data->emplace("arpas", Data {485, kV});
    data->emplace("arpbd", Data {486, kT});
    data->emplace("arpbf", Data {487, kT});
    data->emplace("arpca", Data {488, kT});
    data->emplace("arpch", Data {489, kT});
    data->emplace("arpcs", Data {490, kT});
    data->emplace("arpde", Data {491, kT});
    data->emplace("arpie", Data {492, kT});
    data->emplace("arplo", Data {493, kT});
    data->emplace("arpon", Data {494, kT});
    data->emplace("arppo", Data {495, kV});
    data->emplace("arptw", Data {496, kT});
    data->emplace("arrch", Data {497, kG});
    data->emplace("arreg", Data {498, kT});
    data->emplace("arrsr", Data {499, kT});
    data->emplace("arsch", Data {500, kG});
    data->emplace("artab", Data {501, kL});
    data->emplace("artch", Data {502, kT});
    data->emplace("arton", Data {503, kT});
    data->emplace("artsp", Data {504, kT});
    data->emplace("arttw", Data {505, kT});
    data->emplace("aruch", Data {506, kT});
    data->emplace("arvar", Data {507, kT});
    data->emplace("arvon", Data {508, kT});
    data->emplace("arvtw", Data {509, kT});
    data->emplace("arwor", Data {510, kV});
    data->emplace("arxrc", Data {511, kT});
    data->emplace("arxsc", Data {512, kT});
    data->emplace("yiel", Data {513, kV});
    data->emplace("1aut", Data {514, kW});
    data->emplace("1cfc", Data {515, kW});
    data->emplace("1dlg", Data {516, kW});
    data->emplace("1err", Data {517, kW});
    data->emplace("1fai", Data {518, kW});
    data->emplace("1for", Data {519, kW});
    data->emplace("1ged", Data {520, kW});
    data->emplace("1gzl", Data {521, kB});
    data->emplace("1hie", Data {522, kW});
    data->emplace("1ins", Data {523, kW});
    data->emplace("1int", Data {524, kW});
    data->emplace("1let", Data {525, kW});
    data->emplace("1lin", Data {526, kW});
    data->emplace("1lli", Data {527, kW});
    data->emplace("1mgf", Data {528, kW});
    data->emplace("1mrc", Data {529, kW});
    data->emplace("1obs", Data {530, kW});
    data->emplace("1one", Data {531, kW});
    data->emplace("1pmi", Data {532, kW});
    data->emplace("1reg", Data {533, kB});
    data->emplace("1rex", Data {534, kW});
    data->emplace("1sav", Data {535, kW});
    data->emplace("1src", Data {536, kW});
    data->emplace("1tes", Data {537, kW});
    data->emplace("1tim", Data {538, kW});
    data->emplace("1trk", Data {539, kW});
    data->emplace("1tur", Data {540, kW});
    data->emplace("1val", Data {541, kW});
    data->emplace("1zin", Data {542, kW});
    data->emplace("1zqa", Data{543, kW});
  }
  return data;
}

arsnyder16 avatar May 07 '22 03:05 arsnyder16

@kripken I am wondering if i can attack the opt performance issue differently. Curios of you thoughts. Here is my current situation

  • linux/mac/win, we create a shared object that is then used in about 5 test/tool exes. We also consumed the library in various desktop/server environment for production level products.

    • In this environment the optimization cost of our library is paid one time since the object is shared.
  • wasm, we create a single static library that is then linked into a 5 test/tool js/wasm. Then we create a single wasm file from that static library that can then be consumed by various browser based JavaScript apps.

    • In this environment the optimization cost is is paid 6 times each time we produce a .wasm
    • We also supply pthread and non pthread variations so we double that 6 to 12.

In the end the current wasm route really exposes any performance issues in the optimizer since we are optimizing the same thing multiple times.

Any suggestions here? I haven't journeyed down the Main/Side module avenue, is that my option?

arsnyder16 avatar May 07 '22 19:05 arsnyder16

@arsnyder16 The emplace issue is definitely worth looking into. At a glance, clang -Os emits very different code there - 2x more locals, but less than half the amount of local.gets. Understanding why that is might help if it is something we can optimize better before rse, as you said.

@arsnyder16 One option might be main/side module, but that does add its own overhead for relocations. But you might try it.

Another option might be to build a single big tool instead of 5 tools, and that big tool could have five entry points. For release builds though that wouldn't be good obviously.

No real solution for the pthread vs non-pthread issue - those just need separate builds. In theory we could have a "pthread-lowering" pass that removes it (fast, in linear time) to get a non-pthread build from a pthread one, but for release builds that wouldn't be good.

(If these aren't release builds, btw, you can also just build with lower optimization levels - -O1 should be almost linear time.)

kripken avatar May 11 '22 16:05 kripken