Research notes, SIMD implementation, faster multiplication and `GF(2^m)`
I also found this really useful too: https://www.iacr.org/workshops/ches/ches2013/presentations/CHES2013_Session6_1.pdf (although it focuses a lot on binary curves, rather than prime)
However, it seems like it can't beat the MULX, ADOX, ADCX sequence as described by: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf
And this covers squaring using MULX, ADOX and ADCX sequences: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/large-integer-squaring-ia-paper.pdf
This repository, in addition to https://github.com/cloudflare/sidh/blob/master/p503/arith_amd64.s seems to be enough to adapt libff with the MULX + ADOX/ADCX optimisations, batching loads and then a batch of interleaved adox/adcx for best pipeline usage.
From https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/polynomial-multiplication-instructions-paper.pdf (pages 13-15) - The CLMUL instruction looks like it's unlikely to be able to outperform the above without being able to represent the G(p) field over G(2^n) (I can't find an appropriate irreducible polynomial,... I don't think that's possible to easily represent some arbitrary prime over a binary field)
You would assume that an optimising compiler could produce something similar to the following for fq_impl_int128:
Karatsuba Multiplication 128-bit x 128-bit = 256-bit
// c1c0 = a * b
void GF2m_mul_2x2_xmm(__m128i *c1, __m128i *c0, __m128i b,
__m128i a)
{
__m128i t1, t2;
*c0 = _mm_clmulepi64_si128(a, b, 0x00);
*c1 = _mm_clmulepi64_si128(a, b, 0x11);
t1 = _mm_shuffle_epi32(a, 0xEE);
t1 = _mm_xor_si128(a, t1);
t2 = _mm_shuffle_epi32(b, 0xEE);
t2 = _mm_xor_si128(b, t2);
t1 = _mm_clmulepi64_si128(t1, t2, 0x00);
t1 = _mm_xor_si128(*c0, t1);
t1 = _mm_xor_si128(*c1, t1);
t2 = t1;
t1 = _mm_slli_si128(t1, 8);
t2 = _mm_srli_si128(t2, 8);
*c0 = _mm_xor_si128(*c0, t1);
*c1 = _mm_xor_si128(*c1, t2);
}
Karatsuba Multiplication 256-bit x 256-bit = 512-bit
// c3c2c1c0 = a1a0 * b1b0
{
a1 = _mm_set_epi64x(a->d[3], a->d[2]);
a0 = _mm_set_epi64x(a->d[1], a->d[0]);
b1 = _mm_set_epi64x(b->d[3], b->d[2]);
b0 = _mm_set_epi64x(b->d[1], b->d[0]);
GF2m_mul_2x2_xmm(&c1, &c0, a0, b0);
GF2m_mul_2x2_xmm(&c3, &c2, a1, b1);
a0 = _mm_xor_si128(a0, a1);
b0 = _mm_xor_si128(b0, b1);
GF2m_mul_2x2_xmm(&c5, &c4, a0, b0);
c4 = _mm_xor_si128(c4, c0);
c4 = _mm_xor_si128(c4, c2);
c5 = _mm_xor_si128(c5, c1);
c5 = _mm_xor_si128(c5, c3);
c1 = _mm_xor_si128(c1, c4);
c2 = _mm_xor_si128(c2, c5);
}
Quick Squaring (256-bit)^2 = 512-bit
// c3c2c1c0 = a1a0 * a1a0
{
c0 = _mm_clmulepi64_si128(a0, a0, 0x00);
c1 = _mm_clmulepi64_si128(a0, a0, 0x11);
c2 = _mm_clmulepi64_si128(a1, a1, 0x00);
c3 = _mm_clmulepi64_si128(a1, a1, 0x11);
}
Heya, this is awesome research, thanks for this. That intel paper is really handy.
Out of curiosity, are you thinking of refactoring the libff subroutines to use the ADCX/ADOX/MULX multiplication chain, or linking the subroutines used in barretenberg?
I'm thinking of refactoring the libff subroutines to use ADCX/ADOX/MULX (with load buffering, then interleaving), via a compile-time flag like HAVE_ADCX.
I need some time and to complete some things before I get onto the task though, but it looks like a reasonable speedup.
Also, I think I found the slowdown with GMP on WASM, the fallback method uses the naïve modulo powering method rather than montgomery (and also hits a lot of bounds-checks etc.), so should be able to get that sorted out too which should reduce my WASM proving time.
I will let you know when I start making progress.