Vectorize more algorithms for x86 / x64 using SSE4.2 and/or AVX2
Tracking any remaining algorithms to vectorize.
Optimized via C runtime library functions, like memcpy, counts as vectorized too, as these functions are optimized.
See also #7.
| Algorithm | Vectorized? | Further work | Comment |
|---|---|---|---|
for_each,for_each_n |
compiler | no | |
all_of,any_of,none_of |
no | no | depends on predicate |
contains |
library | no | |
contains_subrange |
library | no | |
find |
library | no | |
find_if,find_if_not |
no | no | depends on predicate |
find_last |
library | no | |
find_last_if,find_last_if_not |
no | no | depends on predicate |
find_end |
library | yes | PR in flight |
find_first_of |
library | no | |
adjacent_find |
library | no | |
count |
library | no | manual vectorization complete, auto vectorization rejected |
count_if |
no | no | auto vectorization takes too much additional code, rejected |
mismatch |
library | no | |
equal |
library | no | |
search |
library | yes | PR in flight |
search_n |
library | no | |
starts_with, ends_with |
library | no | |
fold family |
compiler | no | |
copy, copy_n,copy_backward |
library | no | |
copy_if |
no | no | depends on predicate |
move,move_backward |
library | no | |
swap |
compiler | no | |
swap_ranges |
library | no | |
iter_swap |
non-range | no | |
transform |
compiler | no | |
replace |
library | no | remove library vectorization if compiler implements DevCom-10610477 |
replace_if |
no | no | compiler may implement DevCom-10610477, otherwise nothing can be done |
replace_copy,replace_copy_if |
compiler | no | remove workaround if compiler implements DevCom-10606350 |
fill, fill_n |
library | no | yield to compiler in the future; see #3642, DevCom-10334038, DevCom-10334032; see also #3167 |
generate |
compiler | no | compiler does it, and depends on predicate |
remove |
library | no | |
remove_if |
no | no | depends on predicate |
remove_copy |
library | no | |
remove_copy_if |
no | no | depends on predicate |
unique |
library | no | |
unique_copy |
library | no | |
reverse, reverse_copy |
library | no | |
rotate |
library | maybe | might be dedicated algorithm instead of reverse, but with only 2x speedup, and much complexity |
rotate_copy |
library | no | |
shift_left, shift_right |
library | no | |
shuffle family |
no | no | it is random |
sample |
no | no | it is random |
is_partition,partition_point |
no | no | depends on predicate |
partition family |
no | no | depends on predicate |
sort family, nth_element |
no | no | too complex to implement/maintain |
is_sorted_until |
library | no | |
binary_search family |
no | no | vectorization isn't expected to improve a lot |
includes |
no | no | complex, hard to vectorize and benchmark |
set_* family |
no | no | too complex, successful vectorization unlikely |
merge family |
no | no | too complex, successful vectorization unlikely |
*_heap family |
no | no | too complex, successful vectorization unlikely |
minmax_element family |
library | no | |
minmax family |
library | no | |
clamp |
non-range | no | |
lexicographical_compare |
library | no | |
*_permutation family |
no | maybe | can be done in library, although complex and rare enough |
iota |
compiler | maybe | Vectorized for 4 and 8 bytes elements with some workaround, this can potentially be improved and extended by improving the compiler or by vectorizing manually |
accumulate |
compiler | no | |
inner_product |
compiler | no | |
reduce,transform_reduce, |
compiler | no | |
*_scan family |
no | no | see no way how to do it |
adjacent_difference |
compiler | no | |
partial_sum |
no | no | see no way how to do it |
I mentioned this on Discord, but it seems some of it didn't make it into edits, so I'll just ~~take credit~~ repost it.
- Vectorizing find_first_of in the general case does indeed sound impossible, but specializing it for the most common sizes of s_{first,last} would be significantly more possible.
- For int8 and int16 (and less than 16 bytes to search for), find_first_of looks like an almost perfect match for one of the SSE4.2 string compare instructions. (Not entirely sure which one, though; it's a complicated instruction, and the documentation is written more from a hardware/implementation perspective than user perspective.)
- is_sorted sounds easy. Just load pointer and pointer+sizeof(int32), compare less-or-equal, and check if all true. Sure, it's unaligned, but it's faster than scalar. (For unsigned compare, xor both sides with 0x80000000 then do signed compare.) (Alternatively, you could perform aligned loads then use SSSE3 _mm_alignr_epi8 to get the bytes where you want them. AVX2 _mm256_alignr_epi8 is, as far as I can determine, useless - it can't transfer bytes across 128bit lanes.)
- remove looks like a good match for AVX512_VBMI2 vpcompress*, but AVX512 is currently rare, so I doubt that effort is justified. At least not yet, it can be revisited a few years.
Fair chance some of the above aren't worth the effort, but I don't have the numbers or intuition to guess which. Nor do I have any experience or knowledge regarding ARM vectorization.
This sounds like a reasonable analysis to us. We would want to consider PRs for this, with benchmarks, for individual algorithms at a time (not all at once please!).
#813 tracks ARM64 vectorization and is orthogonal to this issue.
- but AVX512 is currently rare, so I doubt that effort is justified.
According to the Steam Hardware & Software Survey: January 2024, AVX512 has only ~11% incidence, so it's definitely not worth it. Even the relatively old AVX2 has only 92%.
Speeding things up for 92% of people is definitely worth it. Needs a cpuid check so the last 8% don't crash, but they can just fall back to the scalar version.
But yes, 11% is significantly less. And that's 11% of gamers, who tend to have better processors than average. And that's for the third level of AVX512; std::remove needs VBMI2, which is the fifth level.
Though judging by the difference between first, 11.26%, and third, 11.14%, fair chance fifth is also about 11%.
But while 11% (or whatever the real number is) is a small fraction, it's 11% of Windows installations, which is still many millions.
That's why that 'doubt' is there. I don't know how Microsoft's priorities look at that scale.
(Definitely should take the AVX2-capable ones first, though.)
We're still talking about very small fraction of users, because not many programs would spend in these algorithms significant amount of the time (if any time at all). I was pleased to know about #3617 in the sense there's at least one attempt to use the improved algorithm on a large amount of data.
lexicographical_compare / mismatch and improving minmax look like the priority directions, many programs do compare strings and arrays, and minmax is apparently used.
- Vectorizing find_first_of in the general case does indeed sound impossible, but specializing it for the most common sizes of s_{first,last} would be significantly more possible.
I think I see how to make it generally efficient:
- 1-element needle is of course just
find - For long enough needle we can go by individual elements in haystack, and do
vpcmpeqb+vptestloop on the needle, using already explored AVX2 mask technique with its last part. - Finally, for the lengths in the middle, which fits AVX2 reg, we can go by
__m256ivectors in haystack, and put a needle in an__m256ivariable. Compare them, save the mask, then shuffle the needle__m256i, and compare again. Repeat until every needle element is on every position. Thenvporthe resulting masks. Then dovpmovmskband if mask is nonzero, get the result from it.
I guess swap should be considered conditionally performed on range, see #2683.
Not sure whether such strategy is viable:
(We could possibly separate out the pragma that drags in the import lib / static lib from the rest of the
<yvals.h>machinery.)
I guess
swapshould be considered conditionally performed on range, see #2683.
Thanks, I updated the table to mention the issue. I don't think I would attempt on this one.
adjacent_difference
It seems too rare to justify manual vectorization. However looks like it is feasible to use "help the compiler auto-vectorize" strategy. See https://godbolt.org/z/a5EP7q4hT and DevCom-10742868