hashes
hashes copied to clipboard
groestl: add AVX-512/GFNI backend
#718
I took a conservative approach here and kept the same in-memory representation as the original (now soft) backend. This results in an extra two vpermbs (_mm512_permutexvar_epi8) per call to compress and p. (Note: this is not a per-block overhead, as compress now works on a slice of blocks per @newpavlov's recommendation.)
If it's acceptable, I can modify the code to use the same state representation in memory as it does in the register. It should be risk-free as it would be absurd for CPU features to change during execution.
Performance (with -C target-cpu=native, Ryzen 9 7900X, x86_64-pc-windows-msvc):
soft backend:
test groestl256_10 ... bench: 62.62 ns/iter (+/- 1.15) = 161 MB/s
test groestl256_100 ... bench: 604.86 ns/iter (+/- 7.71) = 165 MB/s
test groestl256_1000 ... bench: 5,930.86 ns/iter (+/- 83.92) = 168 MB/s
test groestl256_10000 ... bench: 59,241.11 ns/iter (+/- 535.22) = 168 MB/s
avx512_gfni backend:
test groestl256_10 ... bench: 15.39 ns/iter (+/- 0.42) = 666 MB/s
test groestl256_100 ... bench: 148.98 ns/iter (+/- 5.03) = 675 MB/s
test groestl256_1000 ... bench: 1,402.30 ns/iter (+/- 27.58) = 713 MB/s
test groestl256_10000 ... bench: 13,936.83 ns/iter (+/- 608.29) = 717 MB/s