Add spin_lock_atom_cas_acquire_wait function
For https://github.com/NVIDIA/cutlass/issues/2845
Added spin_lock_atom_cas_acquire_wait function to handle spin lock acquisition with atomic compare-and-swap.
This is functional. https://github.com/flashinfer-ai/flashinfer/pull/2171
Raising it as a proposed solution for what we needed when upgrading to nvidia-cutlass-dsl 4.3.1 https://github.com/NVIDIA/cutlass/issues/2845
Kind regards from FlashInfer & cuDNN :)
acquire wait is not needed. slack Xiao Song and we can schedule a meeting to explain this
the two shot all redue.py fail is related to something else, let's discuss this in the meeting
you can use the new two-shot gemm+ar kernel in cutedsl examples. The one in flashinfer should be an old version.
adding something to CuTeDSL wheel package will take some time, so I would recommend you use the new kernel.
sounds good will discuss with you over slack. will learn about the new kernel example and bring action item back to FI