direct sgemm for AVX2
I've created a version of the direct sgemm code for AVX2 (it's shared with the AVX512 code with very limited ifdefs, so can compile from the same source). Question is if and how to integrate.
The performance behavior is a bit different; AVX2 has fewer and more narrow registers which means that the max matrix size where the direct mode helps is quite a bit smaller than the AVX512 version... so question is if this matters enough.
Performance is roughly like this (slightly rough numbers, so assume 2%-ish run to run variation, more than usual)
| M | N | K | cycles | improvement |
|---|---|---|---|---|
| 1 | 1 | 1 | 66.4 | 3.7x |
| 2 | 2 | 2 | 81.5 | 3.2x |
| 3 | 3 | 3 | 155.5 | 2.3x |
| 4 | 4 | 4 | 103.7 | 2.7x |
| 6 | 6 | 6 | 241.1 | 2.1x |
| 8 | 8 | 8 | 161.3 | 2.3x |
| 16 | 16 | 16 | 391.2 | 2.3x |
| 32 | 32 | 32 | 2212.4 | 2.1x |
| 48 | 48 | 48 | 7204.3 | 1.5x |
| 64 | 64 | 64 | 16412.0 | 1.5x |
| 96 | 96 | 96 | 72043.9 | -3.5% |
| 128 | 128 | 128 | 149079.5 | 1.8% |
| 192 | 192 | 192 | 469975.3 | 1.9% |
| 256 | 256 | 256 | 1086867.4 | 0.5% |
| 512 | 512 | 512 | 8421933.8 | 0.1% |
| 1024 | 1024 | 1024 | 68612844.3 | -0.2% |
| 4 | 16 | 9 | 119.5 | 2.8x |
| 64 | 128 | 192 | 115995.8 | -0.5% |
| 37 | 81 | 193 | 47368.6 | 1.2x |
| 512 | 412 | 800 | 11018306.1 | 0.3% |
| 256 | 1 | 256 | 20164.4 | 4.7x |
| 256 | 2 | 256 | 30256.8 | 3.8x |
| 256 | 4 | 256 | 59394.5 | 1.6x |
| 256 | 8 | 256 | 59772.9 | 1.7x |
| 256 | 16 | 256 | 121440.6 | 0.2% |
| 256 | 32 | 256 | 185859.3 | 0.4% |
| 256 | 64 | 256 | 314719.7 | 0.0% |
| 1 | 256 | 256 | 19268.5 | 2.9x |
| 2 | 256 | 256 | 20925.7 | 2.8x |
| 4 | 256 | 256 | 25787.2 | 2.3x |
| 8 | 256 | 256 | 52181.0 | 1.5x |
| 16 | 256 | 256 | 107479.4 | 0.9% |
| 32 | 256 | 256 | 174880.2 | 0.5% |
| 64 | 256 | 256 | 301695.4 | 1.4% |
| 256 | 256 | 1 | 34168.8 | 1.2x |
| 256 | 256 | 2 | 33140.3 | 1.4x |
| 256 | 256 | 4 | 31909.0 | 1.7x |
| 256 | 256 | 8 | 40252.4 | 1.9x |
| 256 | 256 | 16 | 108578.5 | 0.4% |
| 256 | 256 | 32 | 176255.1 | 0.4% |
| 256 | 256 | 64 | 304494.9 | -2.0% |
| 512 | 512 | 1 | 176441.9 | 1.2% |
| 35 | 8457 | 1760 | 55478699.9 | 3.5% |
There is still plenty of those CPUs around. Like simplistic threshold M<-16 N<=16 K<=16 ?
it goes further out than that, up to M * N * K = 4 * 512 * 512
(outside of that there's still wins but also some losses)
Is "at least one of M, N, K below 64" too simplistic ?
it will be correct, but the optimization goes further out than that. the MNK multiple is the best test I've found so far
(I created a few thousand random M/N/K points, measured and tried to find a decent algorithm that gets as much of the gains as possible without reaching to where it hurts)
example from above: 256 256 8 gets almost a 2x increase.