universal-hashes polyval: detect VPCLMULQDQ at runtime

As of #44, polyval will compile to VPCLMULQDQ instructions on new enough CPU architectures.

We might be able to use a trick similar to https://github.com/RustCrypto/password-hashes/pull/440 where we detect the relevant CPU features and call a special function annotated with target_feature to ensure it's always used where available.

Jul 26 '23 15:07 tarcieri

This is reply to this comment.

POLYVAL/GHASH can be broken down into a parallelizable portion and a sequential portion... there's an accumulation of the output that is inherently sequential, but multiplication of the inputs can be performed in parallel.

I am not sure I understand. In our implementation we XOR input block x with inner state y, multiply the XOR result with h, and store the multiplication result in y. I don't see where we can process 4 input blocks at once, which can be done with _mm512_clmulepi64_epi128

Maybe you had Poly1305 in mind?

Jul 27 '23 11:07 newpavlov

We already implement POLYVAL in parallel using ILP. It could use VPCLMULQDQ instead (automatically, when available, as opposed to requiring special RUSTFLAGS)

Jul 27 '23 13:07 tarcieri

We process one block at a time. ILP is used only for the 3 _mm_clmulepi64_si128 calls, only 2 of which use the same immediate argument. Here is generated assembly for our current implementation: https://rust.godbolt.org/z/zs1acTozM In my understanding, at most we can explicitly merge 2 CLMUL calls with 0x00 immediate into one _mm256_clmulepi64_epi128 call.

Jul 27 '23 17:07 newpavlov

The optimization I wanted to explore in this particular issue is to find a way to enable VPCLMULQDQ optimizations without the user having to pass -C target-cpu=skylake as RUSTFLAGS, i.e. by enabling the required features via target_feature(enable = "...")

Jul 27 '23 21:07 tarcieri