ParallelReductionsBenchmark icon indicating copy to clipboard operation
ParallelReductionsBenchmark copied to clipboard

Using Mat-Mul Instructions, like Arm SME and Intel AMX

Open ashvardanian opened this issue 1 year ago • 5 comments

Array reductions can be represented as a two-stage pipeline built on top of matrix-vector multiplications, where the vector is made of all ones.

Let's say our hardware supports fast 16 by 16 matrix multiplications with a single instruction. We can reshape the input array of length $N$ as a matrix of $16$ rows and $N/16$ columns, and use a tiled matrix-multiplication instruction sliding through that wide matrix, multiplying it by a $16$-element vector of ones, and accumulating into $16$ other floats.

In reality, we can't user Intel AMX with float32 inputs, but we can use Arm SME, and later apply similar techniques to SimSIMD.

ashvardanian avatar Dec 30 '24 20:12 ashvardanian

@ashvardanian I’m picking this up.

alexbarev avatar Dec 30 '24 20:12 alexbarev

@ashvardanian Apparently, neither AWS Graviton 4 nor GCS analogs support SME, so this task might face significant delays.

Running cat /proc/cpuinfo shows support only for sve and sve2.

And directly this can be verified that none of these instruction works using clang 18.1.3:

// Streaming mode
asm("smstart SM"); 
asm("smstop SM");

// ZA storage
asm("smstart ZA");
asm("smstop ZA");

And according to this https://arxiv.org/pdf/2409.18779 M4 chip is the first to support SME.

alexbarev avatar Jan 05 '25 15:01 alexbarev

Then we are out of luck for now, @alexbarev. Let's wait for the next get CPUs.

ashvardanian avatar Jan 05 '25 15:01 ashvardanian

Just wanted to link Linux kernel docs on SME for future use 🤗

ashvardanian avatar Jan 13 '25 10:01 ashvardanian

It would be interesting to see if we can beat CUB with WMMA instructions and async loads. The memory system throughput there is 4.8 TB/s, but CUB only reaches 3.4 TB/s. In the absence of F32 mat-muls, the TF32 and F64 variants can be used.

CUDA C++ and PTX examples can be found in the less_slow.cpp repository 🤗

ashvardanian avatar Feb 12 '25 22:02 ashvardanian