groestl: add AVX-512/GFNI backend

Open robbie01 opened this issue 5 months ago • 1 comments

#718

I took a conservative approach here and kept the same in-memory representation as the original (now soft) backend. This results in an extra two vpermbs (_mm512_permutexvar_epi8) per call to compress and p. (Note: this is not a per-block overhead, as compress now works on a slice of blocks per @newpavlov's recommendation.)

If it's acceptable, I can modify the code to use the same state representation in memory as it does in the register. It should be risk-free as it would be absurd for CPU features to change during execution.

Aug 13 '25 15:08 robbie01

Performance (with -C target-cpu=native, Ryzen 9 7900X, x86_64-pc-windows-msvc):

soft backend:

test groestl256_10    ... bench:          62.62 ns/iter (+/- 1.15) = 161 MB/s
test groestl256_100   ... bench:         604.86 ns/iter (+/- 7.71) = 165 MB/s
test groestl256_1000  ... bench:       5,930.86 ns/iter (+/- 83.92) = 168 MB/s
test groestl256_10000 ... bench:      59,241.11 ns/iter (+/- 535.22) = 168 MB/s

avx512_gfni backend:

test groestl256_10    ... bench:          15.39 ns/iter (+/- 0.42) = 666 MB/s
test groestl256_100   ... bench:         148.98 ns/iter (+/- 5.03) = 675 MB/s
test groestl256_1000  ... bench:       1,402.30 ns/iter (+/- 27.58) = 713 MB/s
test groestl256_10000 ... bench:      13,936.83 ns/iter (+/- 608.29) = 717 MB/s

Aug 13 '25 15:08 robbie01