Add optimization for Adler32 checksum for Power processors
Hi,
This PR introduces a optimization for Adler32 checksum for POWER8+ processors that uses VSX (vector) instructions.
If adler32 do 1 byte at time on the first iteration s1 is s1_0 (_n means iteration n) is the initial value of adler, at beginning _0 is 1 unless adler initial value is different than 1. So s1_1 = s1_0 + c[0] after the first calculation. For the next iteration s1_2 = s1_1 + c[1] and so on. Hence, for iteration N, s1_N = s1_(N-1) + c[N] is the value of s1 on after iteration N. Therefore, for s2, s2_N = s2_0 + Ns1_N + Nc[0] + N-1*c[1] + ... + c[N] In a more general way:
s1_N = s1_0 + sum(i=1 to N)*c[i]
s2_N = s2_0 + N*s1 + sum (i=1 to N)(N-i+1)*c[i]
Where s1_N, s2_N are the values for s1, s2 after N iterations. So if we can process N-byte at time we can obtain adler32 checksum for N-byte at once. Since VSX can support 16-byte vector instructions, we can process 16-byte at time using N = 16 we have:
s1 = s1_16 = s1_0 + sum(i=1 to 16)c[i]
s2 = s2_16 = s2_0 + 16*s1 + sum(i=1 to 16)(16-i+1)*c[i]
The VSX version starts to improve the performance for buffers with size >= 64. The performance is up to 10x better than Adler32 version from adler32 non-vectorized version (average cpu time in ns on 100000 iterations):
| buffer size | adler32 baseline | adler32 power | speedup |
|---|---|---|---|
| 64 | 44.921875 | 41.015625 | - |
| 1024 | 943.359375 | 130.859375 | 7.2 |
| 10*5552 | 42519.531250 | 3974.609375 | 10.7 |
For buffer with length <= than 64 the performance is almost the same of the non-vectorized implementation (with a small performance degradation in some cases):
| buffer size | adler32 baseline | adler32 power |
|---|---|---|
| NULL | 5.859375 | 6.812500 |
| 1 | 3.906250 | 4.859375 |
| 15 | 11.718750 | 12.625000 |
| 48 | 35.156250 | 33.203125 |
FYI this PR uses the same base commit as #457 to add base code for Power optimizations. When either one gets accepted, the other can be rebased to remove the first commit from the PR.
A long time ago, I have done this ticket:
- https://github.com/madler/zlib/issues/847