OpenBLAS direct sgemm for AVX2

I've created a version of the direct sgemm code for AVX2 (it's shared with the AVX512 code with very limited ifdefs, so can compile from the same source). Question is if and how to integrate.

The performance behavior is a bit different; AVX2 has fewer and more narrow registers which means that the max matrix size where the direct mode helps is quite a bit smaller than the AVX512 version... so question is if this matters enough.

Performance is roughly like this (slightly rough numbers, so assume 2%-ish run to run variation, more than usual)

M	N	K	cycles	improvement
1	1	1	66.4	3.7x
2	2	2	81.5	3.2x
3	3	3	155.5	2.3x
4	4	4	103.7	2.7x
6	6	6	241.1	2.1x
8	8	8	161.3	2.3x
16	16	16	391.2	2.3x
32	32	32	2212.4	2.1x
48	48	48	7204.3	1.5x
64	64	64	16412.0	1.5x
96	96	96	72043.9	-3.5%
128	128	128	149079.5	1.8%
192	192	192	469975.3	1.9%
256	256	256	1086867.4	0.5%
512	512	512	8421933.8	0.1%
1024	1024	1024	68612844.3	-0.2%
4	16	9	119.5	2.8x
64	128	192	115995.8	-0.5%
37	81	193	47368.6	1.2x
512	412	800	11018306.1	0.3%
256	1	256	20164.4	4.7x
256	2	256	30256.8	3.8x
256	4	256	59394.5	1.6x
256	8	256	59772.9	1.7x
256	16	256	121440.6	0.2%
256	32	256	185859.3	0.4%
256	64	256	314719.7	0.0%
1	256	256	19268.5	2.9x
2	256	256	20925.7	2.8x
4	256	256	25787.2	2.3x
8	256	256	52181.0	1.5x
16	256	256	107479.4	0.9%
32	256	256	174880.2	0.5%
64	256	256	301695.4	1.4%
256	256	1	34168.8	1.2x
256	256	2	33140.3	1.4x
256	256	4	31909.0	1.7x
256	256	8	40252.4	1.9x
256	256	16	108578.5	0.4%
256	256	32	176255.1	0.4%
256	256	64	304494.9	-2.0%
512	512	1	176441.9	1.2%
35	8457	1760	55478699.9	3.5%

Dec 26 '18 22:12 fenrus75

There is still plenty of those CPUs around. Like simplistic threshold M<-16 N<=16 K<=16 ?

Dec 27 '18 07:12 brada4

it goes further out than that, up to M * N * K = 4 * 512 * 512

(outside of that there's still wins but also some losses)

Dec 27 '18 14:12 fenrus75

Is "at least one of M, N, K below 64" too simplistic ?

Dec 27 '18 18:12 martin-frbg

it will be correct, but the optimization goes further out than that. the MNK multiple is the best test I've found so far

(I created a few thousand random M/N/K points, measured and tried to find a decent algorithm that gets as much of the gains as possible without reaching to where it hurts)

example from above: 256 256 8 gets almost a 2x increase.

Dec 27 '18 18:12 fenrus75