Using Mat-Mul Instructions, like Arm SME and Intel AMX
Array reductions can be represented as a two-stage pipeline built on top of matrix-vector multiplications, where the vector is made of all ones.
Let's say our hardware supports fast 16 by 16 matrix multiplications with a single instruction. We can reshape the input array of length $N$ as a matrix of $16$ rows and $N/16$ columns, and use a tiled matrix-multiplication instruction sliding through that wide matrix, multiplying it by a $16$-element vector of ones, and accumulating into $16$ other floats.
In reality, we can't user Intel AMX with float32 inputs, but we can use Arm SME, and later apply similar techniques to SimSIMD.
@ashvardanian I’m picking this up.
@ashvardanian Apparently, neither AWS Graviton 4 nor GCS analogs support SME, so this task might face significant delays.
Running cat /proc/cpuinfo shows support only for sve and sve2.
And directly this can be verified that none of these instruction works using clang 18.1.3:
// Streaming mode
asm("smstart SM");
asm("smstop SM");
// ZA storage
asm("smstart ZA");
asm("smstop ZA");
And according to this https://arxiv.org/pdf/2409.18779 M4 chip is the first to support SME.
Then we are out of luck for now, @alexbarev. Let's wait for the next get CPUs.
Just wanted to link Linux kernel docs on SME for future use 🤗
It would be interesting to see if we can beat CUB with WMMA instructions and async loads. The memory system throughput there is 4.8 TB/s, but CUB only reaches 3.4 TB/s. In the absence of F32 mat-muls, the TF32 and F64 variants can be used.
CUDA C++ and PTX examples can be found in the less_slow.cpp repository 🤗