[FEA] Add cub block radix sort when sort dimension is small

Open luitjens opened this issue 3 years ago • 1 comments

we should write a simple kernel that dispatches 1 sort batch per CTA using cub when the sorting dimension is small. This would increase throughput and lower latency for batched sorting on small dims.

Jul 20 '22 22:07 luitjens

Switching to DeviceSegmentedSort from DeviceSegmentedRadix solve this.

Jul 28 '22 20:07 luitjens

Addressed in PR #272

Sep 30 '22 00:09 tylera-nvidia