MatX
MatX copied to clipboard
[FEA] Add cub block radix sort when sort dimension is small
we should write a simple kernel that dispatches 1 sort batch per CTA using cub when the sorting dimension is small. This would increase throughput and lower latency for batched sorting on small dims.
Switching to DeviceSegmentedSort from DeviceSegmentedRadix solve this.
Addressed in PR #272