oneDPL Tune vectorized binary_search, lower_bound, and upper

This PR introduces some additional tuning for the vectorized search algorithms (binary_search, lower_bound, and upper_bound) where a series of searches across a provided set of search keys are performed on a single input buffer.

The improvements are explained below:

In the current implementations, __pstl_lower_bound / __pstl_upper_bound have been passed a 64-bit size type to compute indicies in its binary search. For inputs with a large number of search keys, this is a bottleneck in performance and rooflines have shown that this kernel is bound by 64-bit arithmetic. This PR introduces a 2-kernel solution where a 32-bit unsigned type is used for index calculation if there are less than 2^32 search keys with a fallback to 64-bit offset calculation otherwise. This results in a performance improvement across these algorithms of ~1.3x on GPU Series Max 1550 for large inputs.
The second improvement is in the implementation of __pstl_lower_bound. Simplifying the loop body by checking descending powers-of-2 as opposed to continuously dividing the input size by 2 reduces the amount of total arithmetic in the compiled code and results in a performance improvement of 10-15% for large inputs on top of the changes in 1. This improvement requires the size type to be unsigned and is implemented as a new utility function: __shars_lower_bound.

Mar 13 '24 21:03 mmichel11

clang-format is failing in some of the reshuffled code:

-    sizeof(_Dst) == sizeof(_Src) && ::std::is_trivially_copyable_v<_Dst> && ::std::is_trivially_copyable_v<_Src>, _Dst>
+    sizeof(_Dst) == sizeof(_Src) && ::std::is_trivially_copyable_v<_Dst>&& ::std::is_trivially_copyable_v<_Src>, _Dst>

and is something that I think should not be changed.

Mar 14 '24 14:03 mmichel11

@mmichel11, let's merge this PR only after #1239, ok?

Sounds good. I will not merge it until after the tag dispatching PR and rebase.

Mar 19 '24 15:03 mmichel11

There might be potential benefits of utilizing 32-bit addressing on other algorithms as well.

I agree with this, but I also feel like this may be something to take up with the sycl runtime folks. Its not a good user experience to need to have this sort of logic to achieve good performance (and double the number of compiled kernels). Of course, we do not want to have the runtime automatically double the number of kernels for us, but they may be able to improve the performance of the 64-bit version somehow.

Another option @mmichel11 and I were brainstorming offline:

Adding some opt-in define which would use 32-bit addressing everywhere, where the user guarantees they wont be calling the algorithms for n > max(uint32_t), and in return gets a performance boost across the board.

Mar 22 '24 14:03 danhoeflinger

There might be potential benefits of utilizing 32-bit addressing on other algorithms as well.

I agree with this, but I also feel like this may be something to take up with the sycl runtime folks. Its not a good user experience to need to have this sort of logic to achieve good performance (and double the number of compiled kernels). Of course, we do not want to have the runtime automatically double the number of kernels for us, but they may be able to improve the performance of the 64-bit version somehow.

Another option @mmichel11 and I were brainstorming offline:

Adding some opt-in define which would use 32-bit addressing everywhere, where the user guarantees they wont be calling the algorithms for n > max(uint32_t), and in return gets a performance boost across the board.

I agree with this and especially the opt-in define. Many of our size-related types are derived from SYCL's size() or previously get_count() which seem to use 64-bits. We could cast them down if the user permits. We should probably investigate the performance benefits in more detail.

Mar 22 '24 14:03 julianmi

There might be potential benefits of utilizing 32-bit addressing on other algorithms as well.

I agree with this, but I also feel like this may be something to take up with the sycl runtime folks. Its not a good user experience to need to have this sort of logic to achieve good performance (and double the number of compiled kernels). Of course, we do not want to have the runtime automatically double the number of kernels for us, but they may be able to improve the performance of the 64-bit version somehow. Another option @mmichel11 and I were brainstorming offline: Adding some opt-in define which would use 32-bit addressing everywhere, where the user guarantees they wont be calling the algorithms for n > max(uint32_t), and in return gets a performance boost across the board.

I agree with this and especially the opt-in define. Many of our size-related types are derived from SYCL's size() or previously get_count() which seem to use 64-bits. We could cast them down if the user permits. We should probably investigate the performance benefits in more detail.

This is an investigation I plan to do in the future. There are several places now where we define separate kernels to take advantage of 32-bit index types when possible. I think an opt-in define would be the best solution instead of doubling the number of kernels per algorithm.

Edit: I have opened an issue to track this in the future: https://github.com/oneapi-src/oneDPL/issues/1467.

Mar 25 '24 20:03 mmichel11

After offline discussion with @danhoeflinger, we agreed it would be best to group the upper / lower bound implementations together. I have moved them after the bit algorithms as they are required to implement __shars_lower_bound.

Apr 04 '24 18:04 mmichel11

@julianmi @SergeyKopienko @akukanov @danhoeflinger @adamfidel

Are there any additional comments / concerns you have for this PR, or are you ready to approve?

Apr 22 '24 13:04 mmichel11

@mmichel11 no any comments or concerns from my side. I am ready to approve this PR because I can't find something incorrect anymore. But please receive additional approve from somebody too,

Apr 23 '24 09:04 SergeyKopienko

Tune vectorized binary_search, lower_bound, and upper_bound algorithms