codegen/llvm: improve loadTruncate() to use a simple load when possible
forcing llvm to truncate the padding bits on load prevents some optimizations.
fixes 370662c565ba541157e3c69d12625c28787d642e closes https://github.com/ziglang/zig/issues/17768
The idea of this change is to assume that if there is a local variable in it's own alloca, then the padding bits are clean. Now,
- I don't know if it's true (I expect it is possible to write garbage to the bits with some intToPtr / ptrCast. but maybe it can happen when coping from an other variable? )
- I don't know if it the proper way to implement this...
however, it does get the job done:
Benchmark 1 (81 runs): ./benchmark_baseline 1 1000/10000 281483566841860
measurement mean ± σ min … max outliers delta
wall_time 249ms ± 13.0ms 225ms … 298ms 4 ( 5%) 0%
peak_rss 880KB ± 7.17KB 819KB … 881KB 2 ( 2%) 0%
cpu_cycles 836M ± 6.80M 828M … 859M 2 ( 2%) 0%
instructions 2.06G ± 67.9 2.06G … 2.06G 10 (12%) 0%
cache_references 759K ± 142K 521K … 1.21M 1 ( 1%) 0%
cache_misses 17.2K ± 9.76K 1.56K … 53.8K 1 ( 1%) 0%
branch_misses 5.77M ± 16.6K 5.73M … 5.80M 0 ( 0%) 0%
Benchmark 2 (70 runs): ./benchmark_master 1 1000/10000 281483566841860
measurement mean ± σ min … max outliers delta
wall_time 286ms ± 7.91ms 273ms … 306ms 0 ( 0%) 💩+ 15.0% ± 1.4%
peak_rss 909KB ± 0 909KB … 909KB 0 ( 0%) 💩+ 3.4% ± 0.2%
cpu_cycles 1.00G ± 8.34M 987M … 1.02G 0 ( 0%) 💩+ 19.9% ± 0.3%
instructions 2.39G ± 69.2 2.39G … 2.39G 5 ( 7%) 💩+ 16.0% ± 0.0%
cache_references 943K ± 697K 539K … 6.51M 3 ( 4%) 💩+ 24.3% ± 20.5%
cache_misses 28.9K ± 11.8K 6.86K … 65.4K 1 ( 1%) 💩+ 68.4% ± 20.0%
branch_misses 5.89M ± 13.8K 5.87M … 5.92M 0 ( 0%) 💩+ 2.1% ± 0.1%
Benchmark 3 (79 runs): ./benchmark_patched 1 1000/10000 281483566841860
measurement mean ± σ min … max outliers delta
wall_time 253ms ± 8.16ms 237ms … 285ms 6 ( 8%) + 1.7% ± 1.4%
peak_rss 909KB ± 0 909KB … 909KB 0 ( 0%) 💩+ 3.4% ± 0.2%
cpu_cycles 888M ± 8.69M 874M … 918M 1 ( 1%) 💩+ 6.2% ± 0.3%
instructions 2.04G ± 59.2 2.04G … 2.04G 14 (18%) ⚡- 1.1% ± 0.0%
cache_references 978K ± 1.56M 469K … 14.5M 10 (13%) + 29.0% ± 44.9%
cache_misses 25.8K ± 16.7K 1.10K … 70.9K 6 ( 8%) 💩+ 50.0% ± 24.6%
branch_misses 5.80M ± 16.7K 5.76M … 5.84M 13 (16%) + 0.5% ± 0.1%
I don't know if it the proper way to implement this...
given this, combined with the fact that it's failing CI checks, needs a rebase, I'm going to close it. I don't know if it's the proper way to implement it either, unless I dive in and implement the whole thing myself, in which case I might as well close this and start over. sorry for your wasted efforts. feel free to reopen to start the discussion anew.
yeah I'm guessing it should be done by some generic "provenance" framework, maybe in relation to alias analysis / result location semantics..
for the record, to leave some breadcrumb for future implementer:
- I've kept the commit up to date and active my local branch and the compiler works fine with it, never stumbled on a miscompilation since it's last year.
- I've updated the branch xxxbxxx:load-truncate-optim with commit rebased on the current master branch (but I don't know if it's possible to update the closed pullrequest)