[FEA]: Improve and cleanup `ThreadLoad`

Open fbusato opened this issue 1 year ago • 0 comments

[x] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

CUB

ThreadLoad is a fundamental utility in CUB. On the other hand, it misses some key features, especially on recent GPU architectures.

Feature list:

Cache eviction policies on recent GPU architectures (SM70+) are missing
Cache prefetch, e.g. .L2::256B, is not handled
Non-trivially copyable data types are handled incorrectly
We don't support 32-bit platforms anymore. The following macros are not more needed _CUB_ASM_PTR_, __CUB_LP64__, _CUB_ASM_PTR_SIZE_
Some template specializations of ThreadLoad may not be used, e.g. ushort4, adding overhead to the compilation
Add assertions for nullptr accesses and validate the correct memory space, i.e. global
Expose in the cub:: namespace
Add them to public documentation

No response

No response

Sep 30 '24 19:09 fbusato