Use ND-range kernel in parallel_for kernel implementation to improve performance
Performance issues have been identified within our parallel_for kernel on Intel® Data Center GPU Max for input sizes that are not powers of two. The root cause of this has been identified as the usage of a SYCL basic parallel kernel in our implementation.
This PR switches to using an explicit nd-range kernel for GPU devices only within the parallel_for pattern. A grid-strided memory access pattern is used which successfully resolves the identified performance issues. Additionally, performance improvements have been observed for powers of 2 as well.
Algorithms that internally rely on this kernel should see improvements on GPUs (ex: for_each, copy, etc.).
I am marking this PR as draft for now. I have seen substantial performance improvements with basic SYCL kernels in our parallel for pattern with recent compiler / driver updates so this patch may no longer be needed. I will do some wider evaluation on this. Ideally, keeping this kernel as generic as possible is best to reduce the need for future tuning with new architectures / devices.
I have looked back at this PR again to see if it is something that should still be pursued since it is marked as priority.
After further thought, I think this PR can be closed. I still observe that we can get improvements in performance of this pattern by tuning, but I think it will ultimately end up being overtuned for certain architectures and operators / data types. From my testing, we would also have to differentiate between vendor backends to get the best performance on different vendors. This would ultimately add more complexity then it is worth.
Parallel for is a very simple pattern, so if we find poor performance with the basic kernels, then I think we should likely push for improvements within the SYCL basic kernel implementation which I have seen happen. The performance issue that motivated this PR (massive runtime spikes for random non-power of 2 input sizes) no longer exists.
Are there any objections to me closing this PR? @akukanov @julianmi @danhoeflinger
Are there any objections to me closing this PR? @akukanov @julianmi @danhoeflinger
No objections here. I think your argument is sound. Its best if the we can rely on simple code for such a simple pattern. The SYCL implementation should be able to run this efficiently on any hardware (and if it doesn't we should pursue the problem there in the long run). As long as we aren't sacrificing huge performance benefits, I'm fine with closing it.