universal-hashes icon indicating copy to clipboard operation
universal-hashes copied to clipboard

polyval, ghash: use powers of H to process multiple blocks at a time

Open ericlagergren opened this issue 11 months ago • 6 comments

We can compute h = [H^n, H^(n-1), ..., H] and then process N blocks at a time. On a 2020 M1, a stride of 8 runs at about 0.17 cycles per byte whereas a stride of 1 runs at about 1.4 cycles per byte—an ~8x improvement.

Note that sometimes this isn't desirable. For example, HCTR-2 computes POLYVAL over single blocks and the overhead of constructing plus cleaning up an N-wide POLYVAL hurt performance. So, we probably need to offer both a "wide" and a "lite" implementation.

I'm happy to donate my implementation (x86 and aarch64).

ericlagergren avatar Feb 12 '25 18:02 ericlagergren

Exciting work! The parallelism would be prospectively quite helpful for https://github.com/RustCrypto/AEADs/issues/74

It'd be great if you could open a PR. Just looking over your code there are great code comments which are much better than what we currently have.

tarcieri avatar Feb 14 '25 02:02 tarcieri

Arti would benefit from this a lot. Would it be okay if I worked on making @ericlargergren's code into a PR? (It is under a 3BSD license.)

(Alternatively I could try to do a clean-room implementation of the basic technique, which would take more time.)

nmathewson avatar May 11 '25 16:05 nmathewson

You're more than welcome to use my code. I am happy to relicense it as needed. I've been meaning to send a PR myself, but I just haven't had the time yet.

ericlagergren avatar May 11 '25 16:05 ericlagergren

Note: I'd still invite someone to adapt the soft backend from https://github.com/ericlagergren/polyval-rs/

There are possible refactoring improvements from that implementation (e.g. extracting a proper FieldElement type) which would help for more exotic use cases that perform operations over e.g. GHASH's field

tarcieri avatar May 17 '25 17:05 tarcieri

I took a look at adapting the field element multiplication from https://github.com/ericlagergren/polyval-rs/ and the result was slower than the existing BearSSL-derived implementation:

polyval-rs

test bench1_10    ... bench:          33.88 ns/iter (+/- 15.84) = 303 MB/s
test bench2_100   ... bench:         252.40 ns/iter (+/- 157.71) = 396 MB/s
test bench3_1000  ... bench:       2,288.38 ns/iter (+/- 80.40) = 437 MB/s
test bench3_10000 ... bench:      22,635.68 ns/iter (+/- 3,572.99) = 441 MB/s

original 64-bit backend

test bench1_10    ... bench:          25.36 ns/iter (+/- 1.33) = 400 MB/s
test bench2_100   ... bench:         179.73 ns/iter (+/- 22.71) = 558 MB/s
test bench3_1000  ... bench:       1,592.88 ns/iter (+/- 62.69) = 628 MB/s
test bench3_10000 ... bench:      15,819.25 ns/iter (+/- 573.98) = 632 MB/s

That said, it would still be good to adapt the powers-of-H approach as well.

tarcieri avatar Oct 15 '25 14:10 tarcieri

I'm thinking the best way to implement this is have one common implementation of Polyval which impls all the traits like UhfBackend as well as a common implementation of powers-of-H multiblock processing, and then have each backend define its own FieldElement type, so all backends can share a common high-level implementation.

Ideally we could even re-export those FieldElement types, provided they could be made to have a common API surface, for more exotic usages like #236.

Not sure why you didn't do it that way in polyval-rs @ericlagergren, though I'd be curious if you had reasons.

tarcieri avatar Oct 15 '25 14:10 tarcieri