FidelityFX-ParallelSort Kernels read keys and values out of bounds

Hello,

I've recently discovered that if the key/value buffers used in the sort are not a multiple of PARALLELSORT_THREADGROUP_SIZE * 4, then the buffers are read out of bounds and undefined behavior can occur.

See these lines below: https://github.com/GPUOpen-Effects/FidelityFX-ParallelSort/blob/0c539948c8d196ae338d91efbc8ca495f1ea0d1d/ffx-parallelsort/FFX_ParallelSort.h#L133-L137

No bounds checks are done to SrcBuffer here, causing GPU instability when these are read out of bounds.

Also an issue here:

https://github.com/GPUOpen-Effects/FidelityFX-ParallelSort/blob/0c539948c8d196ae338d91efbc8ca495f1ea0d1d/ffx-parallelsort/FFX_ParallelSort.h#L348-L360

Later on, the number of keys is checked, but by that point it's too late: https://github.com/GPUOpen-Effects/FidelityFX-ParallelSort/blob/0c539948c8d196ae338d91efbc8ca495f1ea0d1d/ffx-parallelsort/FFX_ParallelSort.h#L369-L371

I suspect the fix would be to just check the number of keys before pre-loading the key/value pairs.

Reproducing is simple enough, just run the sort on data that is less than PARALLELSORT_THREADGROUP_SIZE * 4 with GPU-assisted validation that checks out of bounds descriptor reads.

Oct 17 '23 18:10 natevm

Thank you for reporting this. I'll file a ticket internally and we'll get it fixed in the next release.

Oct 17 '23 23:10 jlacroixAMD

As I just realized this was reported on the old Parallel Sort sample, a fix to address this will be pushed with the next version of the FidelityFX SDK (which is how we are pushing out most updates to our older features now - https://github.com/GPUOpen-LibrariesAndSDKs/FidelityFX-SDK).

Also, in order to keep the GPU code as fast as possible, the fix will likely be done as a check on the NumKeys value at CPU time with an error code returned in the data setup stage.

Oct 19 '23 02:10 jlacroixAMD

Fwiw this sort implementation has been very helpful.

Small feature request, it would be great to also have a separate dedicated prefix sum/scan and a parallel device selection / compaction. Both of these are used internally by the radix sorter, but would also be helpful standalone.

Oct 19 '23 03:10 natevm

I'll add it to the list of planned improvements to existing samples. Cheers!

Oct 19 '23 03:10 jlacroixAMD