Performance bug for odd sizes in `for_each`, likely other algorithms
SYCL parallel_for, as well as for_each and likely other oneDPL algorithms, have significantly degraded performance for odd sizes. I first noticed this with the Black-Scholes benchmark, where I was seeing a ~10x slowdown with a size of 2*1000*1000*1000 vs 2*1024*1024*1024.
Modifying our for_each algorithm to call SYCL parallel_for with an ND-range instead of a normal range resolves the issue.
std::size_t range_size = rng::distance(local_segment);
std::size_t block_size = 32;
std::size_t nblocks = (rng::distance(local_segment) + block_size - 1) / block_size;
auto event = q.parallel_for(sycl::nd_range<>(nblocks*block_size, block_size),
[=](auto index) {
auto idx = index.get_global_id(0);
if (idx < range_size) {
fn(*(first + idx));
}
});
However, we really shouldn't have to do this workaround---SYCL should be able to handle a parallel_for with a normal range without degrading performance. I am leaving the code as is (without the fix) for now. I will try to create a minimal example and report this as an issue with the SYCL compiler/runtime.
For now, we should just run our benchmarks with nice sizes. If it takes too long for a SYCL fix, we can investigate implementing this workaround, but this will only be partial, as oneDPL algorithms will continue to behave poorly unless they implement the same workaround.
Note that I filed an issue for this bug with DPC++ here: https://jira.devtools.intel.com/browse/CMPLRLLVM-48511.