polyval: detect VPCLMULQDQ at runtime
As of #44, polyval will compile to VPCLMULQDQ instructions on new enough CPU architectures.
We might be able to use a trick similar to https://github.com/RustCrypto/password-hashes/pull/440 where we detect the relevant CPU features and call a special function annotated with target_feature to ensure it's always used where available.
This is reply to this comment.
POLYVAL/GHASH can be broken down into a parallelizable portion and a sequential portion... there's an accumulation of the output that is inherently sequential, but multiplication of the inputs can be performed in parallel.
I am not sure I understand. In our implementation we XOR input block x with inner state y, multiply the XOR result with h, and store the multiplication result in y. I don't see where we can process 4 input blocks at once, which can be done with _mm512_clmulepi64_epi128
Maybe you had Poly1305 in mind?
We already implement POLYVAL in parallel using ILP. It could use VPCLMULQDQ instead (automatically, when available, as opposed to requiring special RUSTFLAGS)
We process one block at a time. ILP is used only for the 3 _mm_clmulepi64_si128 calls, only 2 of which use the same immediate argument. Here is generated assembly for our current implementation: https://rust.godbolt.org/z/zs1acTozM In my understanding, at most we can explicitly merge 2 CLMUL calls with 0x00 immediate into one _mm256_clmulepi64_epi128 call.
The optimization I wanted to explore in this particular issue is to find a way to enable VPCLMULQDQ optimizations without the user having to pass -C target-cpu=skylake as RUSTFLAGS, i.e. by enabling the required features via target_feature(enable = "...")