Performance of inclusive_scan_by_segment() might be improved
Performance of inclusive_scan_by_segment() for 250M elements for code like
int* vals = malloc_shared<int>(num_elements, policy.queue());
int* keys = malloc_shared<int>(num_elements, policy.queue());
int* output = malloc_shared<int>(num_elements, policy.queue());
std::fill(keys, keys + num_elements, 0);
std::iota(vals, vals + num_elements, 1);
std::fill(output, output + num_elements, 0);
auto iter_res = oneapi::dpl::inclusive_scan_by_segment(policy, keys, keys + num_elements, vals, output);
is more than 3 times worse when compared with a sample from "Data Parallel C++" book. CPU is 11th Gen Intel(R) Core(TM) i7-1185G7. Compiler is Intel(R) oneAPI DPC++/C++ Compiler for applications running on Intel(R) 64, Version 2022.0.0 Build 20211123.
On CPU/OCL platform, one bottleneck is calculation of reminder over power of 2 in several places in scan_impl(). The division might be replaced to &-ing over 0b11..11 thus improving performance ~2 times.
On GPU, one reason for performance difference is simultaneous reading of input data and mask. I.e., when scan applied to ints, mask is same size as data, as mask type is unsigned int, so requirement to data moving is actually doubled. It seems we need "where to switch" sign here, not whole int per element. To demonstrate that mask access is an issue it's possible to change unsigned int to unsigned char in FlagType. In my case it gives 40% improvement for GPU/OCL.
Thank you for the analysis. We will work to integrate those changes and see if the same improvements can be made to the other "by_segment" algorithms.