ParallelReductionsBenchmark Using Mat-Mul Instructions, like Arm SME and Intel AMX

Array reductions can be represented as a two-stage pipeline built on top of matrix-vector multiplications, where the vector is made of all ones.

Let's say our hardware supports fast 16 by 16 matrix multiplications with a single instruction. We can reshape the input array of length $N$ as a matrix of $16$ rows and $N/16$ columns, and use a tiled matrix-multiplication instruction sliding through that wide matrix, multiplying it by a $16$-element vector of ones, and accumulating into $16$ other floats.

In reality, we can't user Intel AMX with float32 inputs, but we can use Arm SME, and later apply similar techniques to SimSIMD.

Dec 30 '24 20:12 ashvardanian

@ashvardanian I’m picking this up.

Dec 30 '24 20:12 alexbarev

@ashvardanian Apparently, neither AWS Graviton 4 nor GCS analogs support SME, so this task might face significant delays.

Running cat /proc/cpuinfo shows support only for sve and sve2.

And directly this can be verified that none of these instruction works using clang 18.1.3:

// Streaming mode
asm("smstart SM"); 
asm("smstop SM");

// ZA storage
asm("smstart ZA");
asm("smstop ZA");

And according to this https://arxiv.org/pdf/2409.18779 M4 chip is the first to support SME.

Jan 05 '25 15:01 alexbarev

Then we are out of luck for now, @alexbarev. Let's wait for the next get CPUs.

Jan 05 '25 15:01 ashvardanian

Just wanted to link Linux kernel docs on SME for future use 🤗

Jan 13 '25 10:01 ashvardanian

It would be interesting to see if we can beat CUB with WMMA instructions and async loads. The memory system throughput there is 4.8 TB/s, but CUB only reaches 3.4 TB/s. In the absence of F32 mat-muls, the TF32 and F64 variants can be used.

CUDA C++ and PTX examples can be found in the less_slow.cpp repository 🤗

Feb 12 '25 22:02 ashvardanian