STL Vectorize more algorithms for x86 / x64 using SSE4.2 and/or AVX2

Tracking any remaining algorithms to vectorize.

Optimized via C runtime library functions, like memcpy, counts as vectorized too, as these functions are optimized.

See also #7.

Algorithm	Vectorized?	Further work	Comment
`for_each`,`for_each_n`	compiler	no
`all_of`,`any_of`,`none_of`	no	no	depends on predicate
`contains`	library	no
`contains_subrange`	library	no
`find`	library	no
`find_if`,`find_if_not`	no	no	depends on predicate
`find_last`	library	no
`find_last_if`,`find_last_if_not`	no	no	depends on predicate
`find_end`	library	yes	PR in flight
`find_first_of`	library	no
`adjacent_find`	library	no
`count`	library	no	manual vectorization complete, auto vectorization rejected
`count_if`	no	no	auto vectorization takes too much additional code, rejected
`mismatch`	library	no
`equal`	library	no
`search`	library	yes	PR in flight
`search_n`	library	no
`starts_with`, `ends_with`	library	no
`fold` family	compiler	no
`copy`, `copy_n`,`copy_backward`	library	no
`copy_if`	no	no	depends on predicate
`move`,`move_backward`	library	no
`swap`	compiler	no
`swap_ranges`	library	no
`iter_swap`	non-range	no
`transform`	compiler	no
`replace`	library	no	remove library vectorization if compiler implements DevCom-10610477
`replace_if`	no	no	compiler may implement DevCom-10610477, otherwise nothing can be done
`replace_copy`,`replace_copy_if`	compiler	no	remove workaround if compiler implements DevCom-10606350
`fill`, `fill_n`	library	no	yield to compiler in the future; see #3642, DevCom-10334038, DevCom-10334032; see also #3167
`generate`	compiler	no	compiler does it, and depends on predicate
`remove`	library	no
`remove_if`	no	no	depends on predicate
`remove_copy`	library	no
`remove_copy_if`	no	no	depends on predicate
`unique`	library	no
`unique_copy`	library	no
`reverse`, `reverse_copy`	library	no
`rotate`	library	maybe	might be dedicated algorithm instead of reverse, but with only 2x speedup, and much complexity
`rotate_copy`	library	no
`shift_left`, `shift_right`	library	no
`shuffle` family	no	no	it is random
`sample`	no	no	it is random
`is_partition`,`partition_point`	no	no	depends on predicate
`partition` family	no	no	depends on predicate
`sort` family, `nth_element`	no	no	too complex to implement/maintain
`is_sorted_until`	library	no
`binary_search` family	no	no	vectorization isn't expected to improve a lot
`includes`	no	no	complex, hard to vectorize and benchmark
`set_*` family	no	no	too complex, successful vectorization unlikely
`merge` family	no	no	too complex, successful vectorization unlikely
`*_heap` family	no	no	too complex, successful vectorization unlikely
`minmax_element` family	library	no
`minmax` family	library	no
`clamp`	non-range	no
`lexicographical_compare`	library	no
`*_permutation` family	no	maybe	can be done in library, although complex and rare enough
`iota`	compiler	maybe	Vectorized for 4 and 8 bytes elements with some workaround, this can potentially be improved and extended by improving the compiler or by vectorizing manually
`accumulate`	compiler	no
`inner_product`	compiler	no
`reduce`,`transform_reduce`,	compiler	no
`*_scan` family	no	no	see no way how to do it
`adjacent_difference`	compiler	no
`partial_sum`	no	no	see no way how to do it

Feb 20 '24 13:02 AlexGuteniev

I mentioned this on Discord, but it seems some of it didn't make it into edits, so I'll just ~~take credit~~ repost it.

Vectorizing find_first_of in the general case does indeed sound impossible, but specializing it for the most common sizes of s_{first,last} would be significantly more possible.
For int8 and int16 (and less than 16 bytes to search for), find_first_of looks like an almost perfect match for one of the SSE4.2 string compare instructions. (Not entirely sure which one, though; it's a complicated instruction, and the documentation is written more from a hardware/implementation perspective than user perspective.)
is_sorted sounds easy. Just load pointer and pointer+sizeof(int32), compare less-or-equal, and check if all true. Sure, it's unaligned, but it's faster than scalar. (For unsigned compare, xor both sides with 0x80000000 then do signed compare.) (Alternatively, you could perform aligned loads then use SSSE3 _mm_alignr_epi8 to get the bytes where you want them. AVX2 _mm256_alignr_epi8 is, as far as I can determine, useless - it can't transfer bytes across 128bit lanes.)
remove looks like a good match for AVX512_VBMI2 vpcompress*, but AVX512 is currently rare, so I doubt that effort is justified. At least not yet, it can be revisited a few years.

Fair chance some of the above aren't worth the effort, but I don't have the numbers or intuition to guess which. Nor do I have any experience or knowledge regarding ARM vectorization.

Feb 20 '24 17:02 Alcaro

This sounds like a reasonable analysis to us. We would want to consider PRs for this, with benchmarks, for individual algorithms at a time (not all at once please!).

#813 tracks ARM64 vectorization and is orthogonal to this issue.

Feb 21 '24 22:02 StephanTLavavej

but AVX512 is currently rare, so I doubt that effort is justified.

According to the Steam Hardware & Software Survey: January 2024, AVX512 has only ~11% incidence, so it's definitely not worth it. Even the relatively old AVX2 has only 92%.

Feb 22 '24 09:02 jovibor

Speeding things up for 92% of people is definitely worth it. Needs a cpuid check so the last 8% don't crash, but they can just fall back to the scalar version.

But yes, 11% is significantly less. And that's 11% of gamers, who tend to have better processors than average. And that's for the third level of AVX512; std::remove needs VBMI2, which is the fifth level.

Though judging by the difference between first, 11.26%, and third, 11.14%, fair chance fifth is also about 11%.

But while 11% (or whatever the real number is) is a small fraction, it's 11% of Windows installations, which is still many millions.

That's why that 'doubt' is there. I don't know how Microsoft's priorities look at that scale.

(Definitely should take the AVX2-capable ones first, though.)

Feb 22 '24 10:02 Alcaro

We're still talking about very small fraction of users, because not many programs would spend in these algorithms significant amount of the time (if any time at all). I was pleased to know about #3617 in the sense there's at least one attempt to use the improved algorithm on a large amount of data.

lexicographical_compare / mismatch and improving minmax look like the priority directions, many programs do compare strings and arrays, and minmax is apparently used.

Feb 22 '24 11:02 AlexGuteniev

Vectorizing find_first_of in the general case does indeed sound impossible, but specializing it for the most common sizes of s_{first,last} would be significantly more possible.

I think I see how to make it generally efficient:

1-element needle is of course just find
For long enough needle we can go by individual elements in haystack, and do vpcmpeqb+vptest loop on the needle, using already explored AVX2 mask technique with its last part.
Finally, for the lengths in the middle, which fits AVX2 reg, we can go by __m256i vectors in haystack, and put a needle in an __m256i variable. Compare them, save the mask, then shuffle the needle __m256i, and compare again. Repeat until every needle element is on every position. Then vpor the resulting masks. Then do vpmovmskb and if mask is nonzero, get the result from it.

Apr 05 '24 08:04 AlexGuteniev

I guess swap should be considered conditionally performed on range, see #2683.

Not sure whether such strategy is viable:

(We could possibly separate out the pragma that drags in the import lib / static lib from the rest of the <yvals.h> machinery.)

Apr 12 '24 10:04 frederick-vs-ja

I guess swap should be considered conditionally performed on range, see #2683.

Thanks, I updated the table to mention the issue. I don't think I would attempt on this one.

Apr 18 '24 19:04 AlexGuteniev

adjacent_difference

It seems too rare to justify manual vectorization. However looks like it is feasible to use "help the compiler auto-vectorize" strategy. See https://godbolt.org/z/a5EP7q4hT and DevCom-10742868

Sep 11 '24 06:09 AlexGuteniev