[ENHANCEMENT]: Get rid of of custom atomic operations once CCCL 2.4 is ready
Is your feature request related to a problem? Please describe.
The current cuco implementations use custom atomic functions, e.g. https://github.com/NVIDIA/cuCollections/blob/1c8b92074d9a0d07ff9288626c22ab4f5fb9d6ad/include/cuco/detail/open_addressing/open_addressing_ref_impl.cuh#L904-L936 due to a performance regression with cuda::atomic_ref (https://github.com/NVIDIA/cccl/issues/1008). With the fix being merged into the main branch, we can get rid of those custom functions once CCCL 2.4 is fetched by rapids-cmake
Describe the solution you'd like
Replace https://github.com/NVIDIA/cuCollections/blob/1c8b92074d9a0d07ff9288626c22ab4f5fb9d6ad/include/cuco/detail/open_addressing/open_addressing_ref_impl.cuh#L905
https://github.com/NVIDIA/cuCollections/blob/1c8b92074d9a0d07ff9288626c22ab4f5fb9d6ad/include/cuco/detail/open_addressing/open_addressing_ref_impl.cuh#L947
https://github.com/NVIDIA/cuCollections/blob/1c8b92074d9a0d07ff9288626c22ab4f5fb9d6ad/include/cuco/detail/hyperloglog/hyperloglog_ref.cuh#L525
with corresponding atomic_ref operations.