roaring icon indicating copy to clipboard operation
roaring copied to clipboard

Use vector instructions for union2by2

Open lemire opened this issue 5 years ago • 5 comments

We have ARM assembly for union2by2 as per https://github.com/RoaringBitmap/roaring/pull/287

It should be said that there are vectorized algorithm (hint: ARM NEON) for union routines. It appears in the C code (for SSE only but it is portable).

https://github.com/RoaringBitmap/CRoaring/blob/master/src/array_util.c#L1569-L1647

It is described in section 4.3 of the following paper:

It is actually not difficult to implement. For someone that is either fluent in ARM NEON or wants to learn... it is a great project.

lemire avatar Oct 29 '20 23:10 lemire

https://github.com/DLTcollab/sse2neon seems like a valuable reference

jacksonrnewhouse avatar Oct 30 '20 19:10 jacksonrnewhouse

If someone is interested, I am fluent in the various SIMD dialects. Finding the intrinsics and the instructions is not so difficult. There are few of them to find anyhow.

The hard part is to put it all together.

lemire avatar Oct 30 '20 20:10 lemire

Unfortunately golang hasn't ported over UMIN and UMAX. I opened an issue on it, but will probably have to figure out the bitwise representation for now.

jacksonrnewhouse avatar Nov 01 '20 22:11 jacksonrnewhouse

@lemire, the intersection algorithms use _mm_cmpestrm to compare 2 registers of 8 shorts each, and there isn't an equivalent instruction in arm64, I believe. Have any suggestions? My thoughts are to do something like sse_merge, where you run CMEQ between the registers, then rotate 8 times using EXT, and pop_count the result, shifting right by 4 bits (each equal case sets 16 bits). Any suggestions on other approaches?

jacksonrnewhouse avatar Nov 09 '20 01:11 jacksonrnewhouse

Though cmpestrm looks really good, it is an expensive instruction with many cycles of latency so the approach you suggest is more competitive than it appears.

lemire avatar Nov 09 '20 01:11 lemire