aztec-2.0 Research notes, SIMD implementation, faster multiplication and `GF(2^m)`

I also found this really useful too: https://www.iacr.org/workshops/ches/ches2013/presentations/CHES2013_Session6_1.pdf (although it focuses a lot on binary curves, rather than prime)

However, it seems like it can't beat the MULX, ADOX, ADCX sequence as described by: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf

And this covers squaring using MULX, ADOX and ADCX sequences: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/large-integer-squaring-ia-paper.pdf

This repository, in addition to https://github.com/cloudflare/sidh/blob/master/p503/arith_amd64.s seems to be enough to adapt libff with the MULX + ADOX/ADCX optimisations, batching loads and then a batch of interleaved adox/adcx for best pipeline usage.

From https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/polynomial-multiplication-instructions-paper.pdf (pages 13-15) - The CLMUL instruction looks like it's unlikely to be able to outperform the above without being able to represent the G(p) field over G(2^n) (I can't find an appropriate irreducible polynomial,... I don't think that's possible to easily represent some arbitrary prime over a binary field)

You would assume that an optimising compiler could produce something similar to the following for fq_impl_int128:

Karatsuba Multiplication 128-bit x 128-bit = 256-bit

// c1c0 = a * b
void GF2m_mul_2x2_xmm(__m128i *c1, __m128i *c0, __m128i b,
__m128i a)
{
 __m128i t1, t2;
 *c0 = _mm_clmulepi64_si128(a, b, 0x00);
 *c1 = _mm_clmulepi64_si128(a, b, 0x11);
 t1 = _mm_shuffle_epi32(a, 0xEE);
 t1 = _mm_xor_si128(a, t1);
 t2 = _mm_shuffle_epi32(b, 0xEE);
 t2 = _mm_xor_si128(b, t2);
 t1 = _mm_clmulepi64_si128(t1, t2, 0x00);
 t1 = _mm_xor_si128(*c0, t1);
 t1 = _mm_xor_si128(*c1, t1);
 t2 = t1;
 t1 = _mm_slli_si128(t1, 8);
 t2 = _mm_srli_si128(t2, 8);
 *c0 = _mm_xor_si128(*c0, t1);
 *c1 = _mm_xor_si128(*c1, t2);
}

Karatsuba Multiplication 256-bit x 256-bit = 512-bit

// c3c2c1c0 = a1a0 * b1b0
{
 a1 = _mm_set_epi64x(a->d[3], a->d[2]);
 a0 = _mm_set_epi64x(a->d[1], a->d[0]);
 b1 = _mm_set_epi64x(b->d[3], b->d[2]);
 b0 = _mm_set_epi64x(b->d[1], b->d[0]);
 GF2m_mul_2x2_xmm(&c1, &c0, a0, b0);
 GF2m_mul_2x2_xmm(&c3, &c2, a1, b1);
 a0 = _mm_xor_si128(a0, a1);
 b0 = _mm_xor_si128(b0, b1);
 GF2m_mul_2x2_xmm(&c5, &c4, a0, b0);
 c4 = _mm_xor_si128(c4, c0);
 c4 = _mm_xor_si128(c4, c2);
 c5 = _mm_xor_si128(c5, c1);
 c5 = _mm_xor_si128(c5, c3);
 c1 = _mm_xor_si128(c1, c4);
 c2 = _mm_xor_si128(c2, c5);
}

Quick Squaring (256-bit)^2 = 512-bit

// c3c2c1c0 = a1a0 * a1a0
{
 c0 = _mm_clmulepi64_si128(a0, a0, 0x00);
 c1 = _mm_clmulepi64_si128(a0, a0, 0x11);
 c2 = _mm_clmulepi64_si128(a1, a1, 0x00);
 c3 = _mm_clmulepi64_si128(a1, a1, 0x11);
}

Jul 25 '19 20:07 HarryR

Heya, this is awesome research, thanks for this. That intel paper is really handy.

Out of curiosity, are you thinking of refactoring the libff subroutines to use the ADCX/ADOX/MULX multiplication chain, or linking the subroutines used in barretenberg?

Jul 29 '19 15:07 zac-williamson

I'm thinking of refactoring the libff subroutines to use ADCX/ADOX/MULX (with load buffering, then interleaving), via a compile-time flag like HAVE_ADCX.

I need some time and to complete some things before I get onto the task though, but it looks like a reasonable speedup.

Also, I think I found the slowdown with GMP on WASM, the fallback method uses the naïve modulo powering method rather than montgomery (and also hits a lot of bounds-checks etc.), so should be able to get that sorted out too which should reduce my WASM proving time.

I will let you know when I start making progress.

Jul 29 '19 15:07 HarryR