STL icon indicating copy to clipboard operation
STL copied to clipboard

Vectorize more algorithms for x86 / x64 using SSE4.2 and/or AVX2

Open AlexGuteniev opened this issue 1 year ago • 9 comments

Tracking any remaining algorithms to vectorize.

Optimized via C runtime library functions, like memcpy, counts as vectorized too, as these functions are optimized.

See also #7.

Algorithm Vectorized? Further work Comment
for_each,for_each_n compiler no
all_of,any_of,none_of no no depends on predicate
contains library no
contains_subrange library no
find library no
find_if,find_if_not no no depends on predicate
find_last library no
find_last_if,find_last_if_not no no depends on predicate
find_end library yes PR in flight
find_first_of library no
adjacent_find library no
count library no manual vectorization complete, auto vectorization rejected
count_if no no auto vectorization takes too much additional code, rejected
mismatch library no
equal library no
search library yes PR in flight
search_n library no
starts_with, ends_with library no
fold family compiler no
copy, copy_n,copy_backward library no
copy_if no no depends on predicate
move,move_backward library no
swap compiler no
swap_ranges library no
iter_swap non-range no
transform compiler no
replace library no remove library vectorization if compiler implements DevCom-10610477
replace_if no no compiler may implement DevCom-10610477, otherwise nothing can be done
replace_copy,replace_copy_if compiler no remove workaround if compiler implements DevCom-10606350
fill, fill_n library no yield to compiler in the future; see #3642, DevCom-10334038, DevCom-10334032; see also #3167
generate compiler no compiler does it, and depends on predicate
remove library no
remove_if no no depends on predicate
remove_copy library no
remove_copy_if no no depends on predicate
unique library no
unique_copy library no
reverse, reverse_copy library no
rotate library maybe might be dedicated algorithm instead of reverse, but with only 2x speedup, and much complexity
rotate_copy library no
shift_left, shift_right library no
shuffle family no no it is random
sample no no it is random
is_partition,partition_point no no depends on predicate
partition family no no depends on predicate
sort family, nth_element no no too complex to implement/maintain
is_sorted_until library no
binary_search family no no vectorization isn't expected to improve a lot
includes no no complex, hard to vectorize and benchmark
set_* family no no too complex, successful vectorization unlikely
merge family no no too complex, successful vectorization unlikely
*_heap family no no too complex, successful vectorization unlikely
minmax_element family library no
minmax family library no
clamp non-range no
lexicographical_compare library no
*_permutation family no maybe can be done in library, although complex and rare enough
iota compiler maybe Vectorized for 4 and 8 bytes elements with some workaround, this can potentially be improved and extended by improving the compiler or by vectorizing manually
accumulate compiler no
inner_product compiler no
reduce,transform_reduce, compiler no
*_scan family no no see no way how to do it
adjacent_difference compiler no
partial_sum no no see no way how to do it

AlexGuteniev avatar Feb 20 '24 13:02 AlexGuteniev

I mentioned this on Discord, but it seems some of it didn't make it into edits, so I'll just ~~take credit~~ repost it.

  • Vectorizing find_first_of in the general case does indeed sound impossible, but specializing it for the most common sizes of s_{first,last} would be significantly more possible.
  • For int8 and int16 (and less than 16 bytes to search for), find_first_of looks like an almost perfect match for one of the SSE4.2 string compare instructions. (Not entirely sure which one, though; it's a complicated instruction, and the documentation is written more from a hardware/implementation perspective than user perspective.)
  • is_sorted sounds easy. Just load pointer and pointer+sizeof(int32), compare less-or-equal, and check if all true. Sure, it's unaligned, but it's faster than scalar. (For unsigned compare, xor both sides with 0x80000000 then do signed compare.) (Alternatively, you could perform aligned loads then use SSSE3 _mm_alignr_epi8 to get the bytes where you want them. AVX2 _mm256_alignr_epi8 is, as far as I can determine, useless - it can't transfer bytes across 128bit lanes.)
  • remove looks like a good match for AVX512_VBMI2 vpcompress*, but AVX512 is currently rare, so I doubt that effort is justified. At least not yet, it can be revisited a few years.

Fair chance some of the above aren't worth the effort, but I don't have the numbers or intuition to guess which. Nor do I have any experience or knowledge regarding ARM vectorization.

Alcaro avatar Feb 20 '24 17:02 Alcaro

This sounds like a reasonable analysis to us. We would want to consider PRs for this, with benchmarks, for individual algorithms at a time (not all at once please!).

#813 tracks ARM64 vectorization and is orthogonal to this issue.

StephanTLavavej avatar Feb 21 '24 22:02 StephanTLavavej

  • but AVX512 is currently rare, so I doubt that effort is justified.

According to the Steam Hardware & Software Survey: January 2024, AVX512 has only ~11% incidence, so it's definitely not worth it. Even the relatively old AVX2 has only 92%. image

jovibor avatar Feb 22 '24 09:02 jovibor

Speeding things up for 92% of people is definitely worth it. Needs a cpuid check so the last 8% don't crash, but they can just fall back to the scalar version.

But yes, 11% is significantly less. And that's 11% of gamers, who tend to have better processors than average. And that's for the third level of AVX512; std::remove needs VBMI2, which is the fifth level.

image

Though judging by the difference between first, 11.26%, and third, 11.14%, fair chance fifth is also about 11%.

But while 11% (or whatever the real number is) is a small fraction, it's 11% of Windows installations, which is still many millions.

That's why that 'doubt' is there. I don't know how Microsoft's priorities look at that scale.

(Definitely should take the AVX2-capable ones first, though.)

Alcaro avatar Feb 22 '24 10:02 Alcaro

We're still talking about very small fraction of users, because not many programs would spend in these algorithms significant amount of the time (if any time at all). I was pleased to know about #3617 in the sense there's at least one attempt to use the improved algorithm on a large amount of data.

lexicographical_compare / mismatch and improving minmax look like the priority directions, many programs do compare strings and arrays, and minmax is apparently used.

AlexGuteniev avatar Feb 22 '24 11:02 AlexGuteniev

  • Vectorizing find_first_of in the general case does indeed sound impossible, but specializing it for the most common sizes of s_{first,last} would be significantly more possible.

I think I see how to make it generally efficient:

  • 1-element needle is of course just find
  • For long enough needle we can go by individual elements in haystack, and do vpcmpeqb+vptest loop on the needle, using already explored AVX2 mask technique with its last part.
  • Finally, for the lengths in the middle, which fits AVX2 reg, we can go by __m256i vectors in haystack, and put a needle in an __m256i variable. Compare them, save the mask, then shuffle the needle __m256i, and compare again. Repeat until every needle element is on every position. Then vpor the resulting masks. Then do vpmovmskb and if mask is nonzero, get the result from it.

AlexGuteniev avatar Apr 05 '24 08:04 AlexGuteniev

I guess swap should be considered conditionally performed on range, see #2683.

Not sure whether such strategy is viable:

(We could possibly separate out the pragma that drags in the import lib / static lib from the rest of the <yvals.h> machinery.)

frederick-vs-ja avatar Apr 12 '24 10:04 frederick-vs-ja

I guess swap should be considered conditionally performed on range, see #2683.

Thanks, I updated the table to mention the issue. I don't think I would attempt on this one.

AlexGuteniev avatar Apr 18 '24 19:04 AlexGuteniev

adjacent_difference

It seems too rare to justify manual vectorization. However looks like it is feasible to use "help the compiler auto-vectorize" strategy. See https://godbolt.org/z/a5EP7q4hT and DevCom-10742868

AlexGuteniev avatar Sep 11 '24 06:09 AlexGuteniev