typerep-map icon indicating copy to clipboard operation
typerep-map copied to clipboard

Try branch-free version of Eytzinger binary search

Open chshersh opened this issue 7 years ago • 4 comments

chshersh avatar Jul 11 '18 08:07 chshersh

The issue appeared to be much more complex then I anticipated. I've tried to use the approach described in the paper. But there are many problem on the road:

  1. This approach relies on a compiler to generate cmov instruction so the code is branch-free. But we don't have direct control of the instruction that are emitted from code generator. And at least GHC 8.4.3 do not generate them.
  2. It's not possible to help compiler by having let !unlifted = .... in go unlifted as unlifted values can't be in let binding.
  3. The solution relies on the __builtin_ffs intrinsic to decode result, but implementing __builtin_ffs either in terms of existing ones Wordsize - (ctz# (complement value)) or using unsafe ffi leads to a larger overhead.
  4. The solution relies on efficient preloading of the array so it will be in the cache. But we keep 2 arrays and fetch values from them, so quite likely we efficiently corrupt our caches, breaking performance.
  5. There are no efficient functions for comparing 128bit values (though we can have such in C).
  6. I've tried to use prefetch alone, but either I did that wrong, or performance hit for introducing ST (so we have state token) outterweight performance benefits.

What can be done:

  1. Introduce new primop for ffs into GHC.
  2. Instead of keeping 2 unboxed vectors - keep only one and calculate offset on our own.
  3. Write lookup procedure in cmm directly.

I'm not sure if any of that (except for 1.) worth the trouble. For anyone interested in I have code in https://github.com/qnikst/typerep-map/tree/qn/nobranch but it leads to 1.5x slowdown, maybe anyone would be more lucky than me..

qnikst avatar Oct 14 '18 08:10 qnikst

@qnikst Thanks for this extremely valuable comment and your investigation!

chshersh avatar Oct 14 '18 09:10 chshersh

I've pushed a C version that is called via unsafe ccall, but surprisingly enough it works slower than native Haskell version (but faster than "no-branch" Haskell version). Seems that using a vector of 128bit words may be the only way to really improve speed (if needed)

qnikst avatar Oct 14 '18 10:10 qnikst

Update now I have a version that works the same or a bit better then Haskell version (benchmarks are not precise enough to validate that), but so we can try to add support for that and have a flag to disable C code (may be relevant for ghcjs).

qnikst avatar Oct 14 '18 12:10 qnikst