Try branch-free version of Eytzinger binary search
The issue appeared to be much more complex then I anticipated. I've tried to use the approach described in the paper. But there are many problem on the road:
- This approach relies on a compiler to generate
cmovinstruction so the code is branch-free. But we don't have direct control of the instruction that are emitted from code generator. And at least GHC 8.4.3 do not generate them. - It's not possible to help compiler by having
let !unlifted = .... in go unliftedas unlifted values can't be in let binding. - The solution relies on the
__builtin_ffsintrinsic to decode result, but implementing__builtin_ffseither in terms of existing onesWordsize - (ctz# (complement value))or usingunsafe ffileads to a larger overhead. - The solution relies on efficient preloading of the array so it will be in the cache. But we keep 2 arrays and fetch values from them, so quite likely we efficiently corrupt our caches, breaking performance.
- There are no efficient functions for comparing 128bit values (though we can have such in C).
- I've tried to use prefetch alone, but either I did that wrong, or performance hit for introducing
ST(so we have state token) outterweight performance benefits.
What can be done:
- Introduce new
primopforffsinto GHC. - Instead of keeping 2 unboxed vectors - keep only one and calculate offset on our own.
- Write lookup procedure in
cmmdirectly.
I'm not sure if any of that (except for 1.) worth the trouble. For anyone interested in I have code in https://github.com/qnikst/typerep-map/tree/qn/nobranch but it leads to 1.5x slowdown, maybe anyone would be more lucky than me..
@qnikst Thanks for this extremely valuable comment and your investigation!
I've pushed a C version that is called via unsafe ccall, but surprisingly enough it works slower than native Haskell version (but faster than "no-branch" Haskell version). Seems that using a vector of 128bit words may be the only way to really improve speed (if needed)
Update now I have a version that works the same or a bit better then Haskell version (benchmarks are not precise enough to validate that), but so we can try to add support for that and have a flag to disable C code (may be relevant for ghcjs).