OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

direct sgemm for AVX2

Open fenrus75 opened this issue 7 years ago • 6 comments

I've created a version of the direct sgemm code for AVX2 (it's shared with the AVX512 code with very limited ifdefs, so can compile from the same source). Question is if and how to integrate.

The performance behavior is a bit different; AVX2 has fewer and more narrow registers which means that the max matrix size where the direct mode helps is quite a bit smaller than the AVX512 version... so question is if this matters enough.

Performance is roughly like this (slightly rough numbers, so assume 2%-ish run to run variation, more than usual)

M N K cycles improvement
1 1 1 66.4 3.7x
2 2 2 81.5 3.2x
3 3 3 155.5 2.3x
4 4 4 103.7 2.7x
6 6 6 241.1 2.1x
8 8 8 161.3 2.3x
16 16 16 391.2 2.3x
32 32 32 2212.4 2.1x
48 48 48 7204.3 1.5x
64 64 64 16412.0 1.5x
96 96 96 72043.9 -3.5%
128 128 128 149079.5 1.8%
192 192 192 469975.3 1.9%
256 256 256 1086867.4 0.5%
512 512 512 8421933.8 0.1%
1024 1024 1024 68612844.3 -0.2%
4 16 9 119.5 2.8x
64 128 192 115995.8 -0.5%
37 81 193 47368.6 1.2x
512 412 800 11018306.1 0.3%
256 1 256 20164.4 4.7x
256 2 256 30256.8 3.8x
256 4 256 59394.5 1.6x
256 8 256 59772.9 1.7x
256 16 256 121440.6 0.2%
256 32 256 185859.3 0.4%
256 64 256 314719.7 0.0%
1 256 256 19268.5 2.9x
2 256 256 20925.7 2.8x
4 256 256 25787.2 2.3x
8 256 256 52181.0 1.5x
16 256 256 107479.4 0.9%
32 256 256 174880.2 0.5%
64 256 256 301695.4 1.4%
256 256 1 34168.8 1.2x
256 256 2 33140.3 1.4x
256 256 4 31909.0 1.7x
256 256 8 40252.4 1.9x
256 256 16 108578.5 0.4%
256 256 32 176255.1 0.4%
256 256 64 304494.9 -2.0%
512 512 1 176441.9 1.2%
35 8457 1760 55478699.9 3.5%

fenrus75 avatar Dec 26 '18 22:12 fenrus75

There is still plenty of those CPUs around. Like simplistic threshold M<-16 N<=16 K<=16 ?

brada4 avatar Dec 27 '18 07:12 brada4

it goes further out than that, up to M * N * K = 4 * 512 * 512

(outside of that there's still wins but also some losses)

fenrus75 avatar Dec 27 '18 14:12 fenrus75

Is "at least one of M, N, K below 64" too simplistic ?

martin-frbg avatar Dec 27 '18 18:12 martin-frbg

it will be correct, but the optimization goes further out than that. the MNK multiple is the best test I've found so far

(I created a few thousand random M/N/K points, measured and tried to find a decent algorithm that gets as much of the gains as possible without reaching to where it hurts)

example from above: 256 256 8 gets almost a 2x increase.

fenrus75 avatar Dec 27 '18 18:12 fenrus75