roaring Use vector instructions for union2by2

We have ARM assembly for union2by2 as per https://github.com/RoaringBitmap/roaring/pull/287

It should be said that there are vectorized algorithm (hint: ARM NEON) for union routines. It appears in the C code (for SSE only but it is portable).

https://github.com/RoaringBitmap/CRoaring/blob/master/src/array_util.c#L1569-L1647

It is described in section 4.3 of the following paper:

Roaring Bitmaps: Implementation of an Optimized Software Library, Software: Practice and Experience 48 (4), 2018

It is actually not difficult to implement. For someone that is either fluent in ARM NEON or wants to learn... it is a great project.

Oct 29 '20 23:10 lemire

https://github.com/DLTcollab/sse2neon seems like a valuable reference

Oct 30 '20 19:10 jacksonrnewhouse

If someone is interested, I am fluent in the various SIMD dialects. Finding the intrinsics and the instructions is not so difficult. There are few of them to find anyhow.

The hard part is to put it all together.

Oct 30 '20 20:10 lemire

Unfortunately golang hasn't ported over UMIN and UMAX. I opened an issue on it, but will probably have to figure out the bitwise representation for now.

Nov 01 '20 22:11 jacksonrnewhouse

@lemire, the intersection algorithms use _mm_cmpestrm to compare 2 registers of 8 shorts each, and there isn't an equivalent instruction in arm64, I believe. Have any suggestions? My thoughts are to do something like sse_merge, where you run CMEQ between the registers, then rotate 8 times using EXT, and pop_count the result, shifting right by 4 bits (each equal case sets 16 bits). Any suggestions on other approaches?

Nov 09 '20 01:11 jacksonrnewhouse

Though cmpestrm looks really good, it is an expensive instruction with many cycles of latency so the approach you suggest is more competitive than it appears.

Nov 09 '20 01:11 lemire