[FEA] distributed_helper used to allow mem_order="acquire" in spin_lock_wait, but in 4.3.1 pip package only "relaxed" is exposed via
Which component requires the feature?
CuTe DSL
Feature Request
Is your feature request related to a problem? Please describe.
distributed_helper used to allow mem_order="acquire" in spin_lock_wait, but in 4.3.1 pip package only "relaxed" is exposed via spin_lock_atom_cas_relaxed_wait. wish there is an "acquire" version exposed somehow as well
Describe the solution you'd like one possibility : spin_lock_atom_cas_acquire_wait
Describe alternatives you've considered keep mem_order string arg may be some of our source code is ported from examples? i'm not sure, need to check ...
Additional context used by https://github.com/aleozlx/flashinfer/blob/442dec9bea569f53e01b799a2e0328c2ea30bbca/flashinfer/cute_dsl/gemm_allreduce_two_shot.py#L1399-L1403 https://github.com/NVIDIA/cutlass/blob/v4.3.1/python/CuTeDSL/cutlass/utils/distributed_helpers.py#L136