Performance bug for odd sizes in `for_each`, likely other algorithms

Open BenBrock opened this issue 2 years ago • 1 comments

SYCL parallel_for, as well as for_each and likely other oneDPL algorithms, have significantly degraded performance for odd sizes. I first noticed this with the Black-Scholes benchmark, where I was seeing a ~10x slowdown with a size of 2*1000*1000*1000 vs 2*1024*1024*1024.

Modifying our for_each algorithm to call SYCL parallel_for with an ND-range instead of a normal range resolves the issue.

    std::size_t range_size = rng::distance(local_segment);
    std::size_t block_size = 32;
    std::size_t nblocks = (rng::distance(local_segment) + block_size - 1) / block_size;
    auto event = q.parallel_for(sycl::nd_range<>(nblocks*block_size, block_size),
                                [=](auto index) {
                                  auto idx = index.get_global_id(0);
                                  if (idx < range_size) {
                                    fn(*(first + idx));
                                  }
                                });

However, we really shouldn't have to do this workaround---SYCL should be able to handle a parallel_for with a normal range without degrading performance. I am leaving the code as is (without the fix) for now. I will try to create a minimal example and report this as an issue with the SYCL compiler/runtime.

For now, we should just run our benchmarks with nice sizes. If it takes too long for a SYCL fix, we can investigate implementing this workaround, but this will only be partial, as oneDPL algorithms will continue to behave poorly unless they implement the same workaround.

Jun 08 '23 19:06 BenBrock

Note that I filed an issue for this bug with DPC++ here: https://jira.devtools.intel.com/browse/CMPLRLLVM-48511.

Jun 08 '23 23:06 BenBrock