oneDPL
oneDPL copied to clipboard
Pass value and remove barrier between transform_reduce and reduce_over_group
This PR modifies how values are passed between transform_reduce and reduce_over_group. As opposed to moving data into local memory, syncing the work_group, then unloading we just move the value into a register and return it.
This allows us to save local memory loads and stores, and a barrier. We have seen improved performance on both Nvidia and Intel GPUs from this PR.
@AidanBeltonS Please rebase this PR off main to resolve the conflicts.
@AidanBeltonS Please rebase this PR off main to resolve the conflicts.
I have rebased the PR