OpenBLAS Request: ARM SME support (for Apple M4)..

No need for unofficial Apple AMX intruction set on M4.. 2tflops possible..

May 23 '24 15:05 oscarbg

PRs welcome... do you have the hardware to test ?

May 23 '24 16:05 martin-frbg

Not yet.. waiting for a mac mini m4..

May 24 '24 00:05 oscarbg

I think this can be closed, zero chance to run your cpuid on ipad, and normal computers release year later. AMX is NOT ISA , it is a co-processor with prefixed instructions emitted from main cpu. Like FPU on 80386 or crypto accelerators nowadays. There is no public documentation outsude accelerate cblas using it.

May 24 '24 14:05 brada4

try reading that again, it's about SME...

May 24 '24 14:05 martin-frbg

SME

Which is rumoured on some sites to be present....

May 24 '24 18:05 brada4

You can develop and test using the Fixed Virtual Platform (FVP): https://github.com/apache/tvm/pull/16755 https://github.com/apache/tvm/pull/16749

May 25 '24 18:05 Mousius

Gcc11+ can compile it, the question is whether itvis supported on particular cpu.

May 25 '24 20:05 brada4

some implementation hints there: https://scalable.uni-jena.de/opt/sme/index.html

Aug 16 '24 07:08 martin-frbg

Hi, any news? have a Mac Mini m4 to test..

Nov 22 '24 07:11 oscarbg

Bought an M4 mini myself recently but have not gotten around to doing much with it yet.

Nov 22 '24 10:11 martin-frbg

#5084 added SME for the "small matrix" SGEMM pathway but needs some small tweaks to connect the M4 cpu target to it

#5011 has a more general SME GEMM kernel but needs fixes for proper SYMM/TRMM support before it can be merged

Feb 22 '25 22:02 martin-frbg

#5084 added SME for the "small matrix" SGEMM pathway but needs some small tweaks to connect the M4 cpu target to it

#5011 has a more general SME GEMM kernel but needs fixes for proper SYMM/TRMM support before it can be merged

Based on a SC24 workshop Hello SME, https://github.com/llvm/llvm-project/issues/114987 and https://github.com/llvm/llvm-project/pull/95478 . Apple M4 does not support SVE outside of streaming. However, concurrent [WIP] https://github.com/OpenMathLib/OpenBLAS/pull/5011 is on top of KERNEL.ARMV8SVE. Result in illegal instruction. Any good ideas to solve that? Is create a new KERNEL.M4SME2 based on KERNEL.ARMV8 a good idea?

Mar 27 '25 08:03 ITCJ

Further more, I have made some test on differences between SME1 and SME2 recently. It's quiet different to achieve best performance. I don‘t known if ACLE could fully utilize these resources.

Mar 27 '25 09:03 ITCJ

Yes, M4 only does streaming SVE so you'd need at least some setup code to enter streaming mode and perhaps save some dual-use registers beforehand, or even work in a totally different set of registers than what the existing SVE code uses.

Both #5011 and #5084 introduced an ARMV9SME target for differentiation, it would also be possible to select kernel implementations (either at the KERNEL file level or within individual implementations) based on HAVE_SME or a similiar define. As #5011 is a WIP only concerned with GEMM and related functions, it does not work outside its narrow scope.

The way forward - at least short-term - should be to split out M4 from the general "VORTEX" target into its own designation and enable the SME-based "small gemm" pathway for it. I hope to complete this very soon.

Mar 27 '25 21:03 martin-frbg

Hi, I would like to ask why I encountered the following error on M4pro:

Is it possible that my compiler does not recognize streaming flags?

Compilation: clang -g -O0 -march=armv9.2-a+sme+sme2 ./test_sme_acle.cc -o ./test_sme_acle

Clang version: Homebrew clang version 20.1.2 Target: arm64-apple-darwin24.3.0 Thread model: posix InstalledDir: /opt/homebrew/Cellar/llvm/20.1.2/bin

Apr 10 '25 10:04 violet73

Hi, I would like to ask why I encountered the following error on M4pro:

Is it possible that my compiler does not recognize streaming flags?

Compilation: clang -g -O0 -march=armv9.2-a+sme+sme2 ./test_sme_acle.cc -o ./test_sme_acle

Clang version: Homebrew clang version 20.1.2 Target: arm64-apple-darwin24.3.0 Thread model: posix InstalledDir: /opt/homebrew/Cellar/llvm/20.1.2/bin

looks like the same problem mentioned above. Try using disassemble --mixed to show illegal instruction.

Apr 10 '25 13:04 ITCJ

looks like the same problem mentioned above. Try using disassemble --mixed to show illegal instruction.

Thank you for your kindly reply, I disassembled it in lldb and the illegel instruction turns out to be cntd!

That means the streaming flags will make the compiler add some illegal sve instructions that are not in streaming mode.

This is somehow wired. Because I can't manually set the streaming mode before main is called.

So I tried to remove the streaming flags in main and moved the sve code into another function foo with the local streaming flags.

I also manually placed the invocation statement of foo within smstart and smstop.

After these, the code could finally run normally!

Apr 10 '25 13:04 violet73

looks like the same problem mentioned above. Try using disassemble --mixed to show illegal instruction.

Thank you for your kindly reply, I disassembled it in lldb and the illegel instruction turns out to be cntd!

That means the streaming flags will make the compiler add some illegal sve instructions that are not in streaming mode.

This is somehow wired. Because I can't manually set the streaming mode before main is called.

So I tried to remove the streaming flags in main and moved the sve code into another function foo with the local streaming flags.

I also manually placed the invocation statement of foo within smstart and smstop.

After these, the code could finally run normally!

congrats， I also tried resolve similar issues. I encounter ADVL during unit test. 加个微信？码发你邮箱了捏。

Apr 10 '25 13:04 ITCJ

the fast path for small matrix sgemm should be working on M4 with #5222 - but I'm now stuck on an illegal instruction error involving cntd/cntw myself, trying to get dot_kernel_sve working in streaming mode with the __arm_streaming attribute

Apr 11 '25 15:04 martin-frbg

the fast path for small matrix sgemm should be working on M4 with #5222 - but I'm now stuck on an illegal instruction error involving cntd/cntw myself, trying to get dot_kernel_sve working in streaming mode with the __arm_streaming attribute

Is it a bug of LLVM compiler?

Apr 14 '25 08:04 ITCJ