Add spin_lock_atom_cas_acquire_wait function

Open aleozlx opened this issue 2 months ago • 1 comments

For https://github.com/NVIDIA/cutlass/issues/2845

Added spin_lock_atom_cas_acquire_wait function to handle spin lock acquisition with atomic compare-and-swap.

Dec 05 '25 03:12 aleozlx

This is functional. https://github.com/flashinfer-ai/flashinfer/pull/2171

Raising it as a proposed solution for what we needed when upgrading to nvidia-cutlass-dsl 4.3.1 https://github.com/NVIDIA/cutlass/issues/2845

Kind regards from FlashInfer & cuDNN :)

Dec 05 '25 03:12 aleozlx

acquire wait is not needed. slack Xiao Song and we can schedule a meeting to explain this

Dec 12 '25 07:12 XiaoSong9905

the two shot all redue.py fail is related to something else, let's discuss this in the meeting

Dec 12 '25 07:12 XiaoSong9905

you can use the new two-shot gemm+ar kernel in cutedsl examples. The one in flashinfer should be an old version.

adding something to CuTeDSL wheel package will take some time, so I would recommend you use the new kernel.

Dec 12 '25 07:12 shubaoyu2

sounds good will discuss with you over slack. will learn about the new kernel example and bring action item back to FI

Dec 16 '25 02:12 aleozlx