MatX icon indicating copy to clipboard operation
MatX copied to clipboard

[FEA] Add cub block radix sort when sort dimension is small

Open luitjens opened this issue 3 years ago • 1 comments

we should write a simple kernel that dispatches 1 sort batch per CTA using cub when the sorting dimension is small. This would increase throughput and lower latency for batched sorting on small dims.

luitjens avatar Jul 20 '22 22:07 luitjens

Switching to DeviceSegmentedSort from DeviceSegmentedRadix solve this.

luitjens avatar Jul 28 '22 20:07 luitjens

Addressed in PR #272

tylera-nvidia avatar Sep 30 '22 00:09 tylera-nvidia