cccl
cccl copied to clipboard
[FEA]: Improve and cleanup `ThreadLoad`
Is this a duplicate?
- [x] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct
Area
CUB
Is your feature request related to a problem? Please describe.
ThreadLoad is a fundamental utility in CUB. On the other hand, it misses some key features, especially on recent GPU architectures.
Describe the solution you'd like
Feature list:
- Cache eviction policies on recent GPU architectures (SM70+) are missing
- Cache prefetch, e.g.
.L2::256B, is not handled - Non-trivially copyable data types are handled incorrectly
- We don't support 32-bit platforms anymore. The following macros are not more needed
_CUB_ASM_PTR_,__CUB_LP64__,_CUB_ASM_PTR_SIZE_ - Some template specializations of
ThreadLoadmay not be used, e.g.ushort4, adding overhead to the compilation - Add assertions for
nullptraccesses and validate the correct memory space, i.e.global - Expose in the
cub::namespace - Add them to public documentation
Describe alternatives you've considered
No response
Additional context
No response