Improve autovectorization of to_lowercase / to_uppercase functions
Refactor the code in the convert_while_ascii helper function to make it more suitable for auto-vectorization and also process the full ascii prefix of the string. The generic case conversion logic will only be invoked starting from the first non-ascii character.
The runtime on a microbenchmark with a small ascii-only input decreases from ~55ns to ~18ns per iteration. The new implementation also reduces the amount of unsafe code and encapsulates all unsafe inside the helper function.
Fixes #123712
r? @cuviper
rustbot has assigned @cuviper. They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.
Use r? to explicitly pick a reviewer
r? @the8472
The assembly for x86 and aarch64 can also be seen at https://rust.godbolt.org/z/x6T65nE8E
Hi @jhorstmann do you have any benchmark for this? Thx!
@Marcondiro only the microbenchmark included in this PR. On my machine (Intel i9-11900KB) the performance increases by nearly 3x. This is without any target-specific compiler flags, rerunning them now with:
./x.py bench library/alloc/ --stage 0 --test-args to_lowercase
Before
benchmarks:
string::bench_to_lowercase 57.00ns/iter +/- 1.00ns
After
benchmarks:
string::bench_to_lowercase 20.00ns/iter +/- 1.00ns
Cool!
I tried on an M1 and the results are not as good
Before:
string::bench_to_lowercase 102.00ns/iter +/- 1.00ns
After:
string::bench_to_lowercase 99.00ns/iter +/- 1.00ns
Maybe autovec is not as powerful here idk
Thanks for running the benchmarks, glad that there is no regression on arm. The improvement on x86 mostly comes from the usage of the pmovmskb instruction, the equivalent on arm requires multiple instructions with higher latency/lower throughput.
Sorry, I was testing your solution against my fork that I am working on instead of upstream 🙈 The actual result is Before:
string::bench_to_lowercase 45.00ns/iter +/- 0.00ns
After:
string::bench_to_lowercase 99.00ns/iter +/- 1.00ns
Thanks again, I was able to reproduce this on an aws c7g instance. It seems there were some bounds check remaining in the generated assembly, which were removed by the compiler on x86_64 and also in the simplified version I checked in compiler explorer. Adding some assume fixes that and the benchmarks on c7g / Neoverse-V1 are now
Before
benchmarks:
string::bench_to_lowercase 61.00ns/iter +/- 0.00ns
After
benchmarks:
string::bench_to_lowercase 28.00ns/iter +/- 0.00ns
:umbrella: The latest upstream changes (presumably #124773) made this pull request unmergeable. Please resolve the merge conflicts.
@the8472 can you take a look at this PR? It might have been lost in the review queue.
A couple of thoughts:
- A similar thing is done also here: https://github.com/rust-lang/rust/blob/abb95639ef2b837dbfe7b5d18f51fadda29711cb/library/core/src/slice/ascii.rs#L328-L341 maybe would be nice to somehow merge the two?
- Would be interesting to understand why the compiler doesn't use
pmovmskbif not writing the code in this specific way (eg. using& 0x8080808080808080)
Updated benchmark results:
master (#eda9d7f987de76b9d61c633a6ac328936e1b94f0)
benchmarks:
str::to_lowercase::long_lorem_ipsum 534.48/iter +/- 33.86
str::to_lowercase::short_ascii 20.11/iter +/- 0.53
str::to_lowercase::short_mixed 240.21/iter +/- 3.40
str::to_lowercase::short_pile_of_poo 262.54/iter +/- 8.33
PR (#b03d93962816fd82afb619e0cf2083dc67e218e8)
benchmarks:
str::to_lowercase::long_lorem_ipsum 148.49/iter +/- 1.24
str::to_lowercase::short_ascii 12.14/iter +/- 0.13
str::to_lowercase::short_mixed 240.31/iter +/- 3.51
str::to_lowercase::short_pile_of_poo 259.42/iter +/- 7.45
Benchmark results in the initial PR comments were using a different input, these are using the same input strings as several existing benchmarks. I did not get around yet to rerunning on aarch64.
Update: Above results were with -Ctarget-cpu=native, with default target the improvements are bigger:
master
str::to_lowercase::long_lorem_ipsum 1027.63/iter +/- 39.76
str::to_lowercase::short_ascii 34.39/iter +/- 0.73
str::to_lowercase::short_mixed 262.95/iter +/- 16.07
str::to_lowercase::short_pile_of_poo 261.71/iter +/- 15.26
PR
str::to_lowercase::long_lorem_ipsum 175.39/iter +/- 3.18
str::to_lowercase::short_ascii 12.25/iter +/- 0.30
str::to_lowercase::short_mixed 237.15/iter +/- 11.36
str::to_lowercase::short_pile_of_poo 262.57/iter +/- 8.37
Benchmark results on aarch64 (Apple M1)
master:
str::to_lowercase::long_lorem_ipsum 523.29/iter +/- 3.10
str::to_lowercase::short_ascii 33.31/iter +/- 0.29
str::to_lowercase::short_mixed 299.64/iter +/- 23.58
str::to_lowercase::short_pile_of_poo 295.03/iter +/- 9.97
PR:
str::to_lowercase::long_lorem_ipsum 129.03/iter +/- 2.37
str::to_lowercase::short_ascii 23.27/iter +/- 0.49
str::to_lowercase::short_mixed 271.17/iter +/- 31.33
str::to_lowercase::short_pile_of_poo 272.03/iter +/- 3.60
Great results even without pmovmskb 👍
@bors r+
:pushpin: Commit b5ea85f6949d04e9b47f075719a76c55ab9584b9 has been approved by the8472
It is now in the queue for this repository.
:hourglass: Testing commit b5ea85f6949d04e9b47f075719a76c55ab9584b9 with merge 98a3a2c7b688228ce7968f6c7d72b9c50724844a...
The job i686-gnu failed! Check out the build log: (web) (plain)
Click to see the possible cause of the failure (guessed by this bot)
failures:
---- [codegen] tests/codegen/issues/issue-123712-str-to-lower-autovectorization.rs stdout ----
error: verification with 'FileCheck' failed
status: exit status: 1
command: "/checkout/obj/build/i686-unknown-linux-gnu/llvm/build/bin/FileCheck" "--input-file" "/checkout/obj/build/i686-unknown-linux-gnu/test/codegen/issues/issue-123712-str-to-lower-autovectorization/issue-123712-str-to-lower-autovectorization.ll" "/checkout/tests/codegen/issues/issue-123712-str-to-lower-autovectorization.rs" "--check-prefix=CHECK" "--check-prefix" "NONMSVC" "--allow-unused-prefixes" "--dump-input-context" "100"
--- stderr -------------------------------
/checkout/tests/codegen/issues/issue-123712-str-to-lower-autovectorization.rs:8:11: error: CHECK: expected string not found in input
/checkout/tests/codegen/issues/issue-123712-str-to-lower-autovectorization.rs:8:11: error: CHECK: expected string not found in input
// CHECK: [[A:%[0-9]]] = load <16 x i8>
^
/checkout/obj/build/i686-unknown-linux-gnu/test/codegen/issues/issue-123712-str-to-lower-autovectorization/issue-123712-str-to-lower-autovectorization.ll:10:38: note: scanning from here
define noundef i32 @lower_while_ascii(ptr noalias nocapture noundef nonnull readonly align 1 %0, i32 noundef %1, ptr noalias nocapture noundef nonnull writeonly align 1 %2, i32 noundef %3) unnamed_addr #0 personality ptr @rust_eh_personality {
^
/checkout/obj/build/i686-unknown-linux-gnu/test/codegen/issues/issue-123712-str-to-lower-autovectorization/issue-123712-str-to-lower-autovectorization.ll:33:2: note: possible intended match here
%5 = load <8 x i8>, ptr %input.sroa.0.021, align 1
Input file: /checkout/obj/build/i686-unknown-linux-gnu/test/codegen/issues/issue-123712-str-to-lower-autovectorization/issue-123712-str-to-lower-autovectorization.ll
Check file: /checkout/tests/codegen/issues/issue-123712-str-to-lower-autovectorization.rs
-dump-input=help explains the following input dump.
Input was:
<<<<<<
<<<<<<
1: ; ModuleID = 'issue_123712_str_to_lower_autovectorization.d6ee0dc286ccc26c-cgu.0'
2: source_filename = "issue_123712_str_to_lower_autovectorization.d6ee0dc286ccc26c-cgu.0"
3: target datalayout = "e-m:e-p:32:32-p270:32:32-p271:32:32-p272:64:64-i128:128-f64:32:64-f80:32-n8:16:32-S128"
4: target triple = "i686-unknown-linux-gnu"
5:
6: @alloc_9aa1bc107ccd5c4e0fb094a54d1f6c40 = private unnamed_addr constant <{ [77 x i8] }> <{ [77 x i8] c"/checkout/tests/codegen/issues/issue-123712-str-to-lower-autovectorization.rs" }>, align 1
7: @alloc_e04a128e070b1881e32fb6e8157ed5f8 = private unnamed_addr constant <{ ptr, [12 x i8] }> <{ ptr @alloc_9aa1bc107ccd5c4e0fb094a54d1f6c40, [12 x i8] c"M\00\00\00\12\00\00\00\19\00\00\00" }>, align 4
9: ; Function Attrs: uwtable
9: ; Function Attrs: uwtable
10: define noundef i32 @lower_while_ascii(ptr noalias nocapture noundef nonnull readonly align 1 %0, i32 noundef %1, ptr noalias nocapture noundef nonnull writeonly align 1 %2, i32 noundef %3) unnamed_addr #0 personality ptr @rust_eh_personality {
check:8'0 X~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ error: no match found
check:8'0 ~~~~~~~
check:8'0 ~~~~~~~
12: %_8.i = icmp ugt i32 %1, %3
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
13: br i1 %_8.i, label %bb1.i, label %"_ZN106_$LT$core..ops..range..Range$LT$usize$GT$$u20$as$u20$core..slice..index..SliceIndex$LT$$u5b$T$u5d$$GT$$GT$9index_mut17h3804f3258f871e31E.exit"
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:8'0 ~
check:8'0 ~
15: bb1.i: ; preds = %start
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~
16: ; call core::slice::index::slice_end_index_len_fail
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
17: tail call void @_ZN4core5slice5index24slice_end_index_len_fail17hac67c3d99c158e18E(i32 noundef %1, i32 noundef %3, ptr noalias noundef nonnull readonly align 4 dereferenceable(16) @alloc_e04a128e070b1881e32fb6e8157ed5f8) #3, !noalias !2
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
18: unreachable
check:8'0 ~~~~~~~~~~~~~
check:8'0 ~
check:8'0 ~
20: "_ZN106_$LT$core..ops..range..Range$LT$usize$GT$$u20$as$u20$core..slice..index..SliceIndex$LT$$u5b$T$u5d$$GT$$GT$9index_mut17h3804f3258f871e31E.exit": ; preds = %start
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
21: %_917 = icmp ugt i32 %1, 7
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
22: br i1 %_917, label %bb3.preheader.preheader, label %bb12
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:8'0 ~
check:8'0 ~
24: bb3.preheader.preheader: ; preds = %"_ZN106_$LT$core..ops..range..Range$LT$usize$GT$$u20$as$u20$core..slice..index..SliceIndex$LT$$u5b$T$u5d$$GT$$GT$9index_mut17h3804f3258f871e31E.exit"
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
25: %4 = and i32 %1, -8
check:8'0 ~~~~~~~~~~~~~~~~~~~~~
26: br label %bb3.preheader
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~
check:8'0 ~
check:8'0 ~
28: bb3.preheader: ; preds = %bb3.preheader.preheader, %bb18.preheader
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
29: %input.sroa.0.021 = phi ptr [ %_63, %bb18.preheader ], [ %0, %bb3.preheader.preheader ]
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
30: %input.sroa.6.020 = phi i32 [ %_62, %bb18.preheader ], [ %1, %bb3.preheader.preheader ]
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
31: %output.sroa.0.019 = phi ptr [ %_76, %bb18.preheader ], [ %2, %bb3.preheader.preheader ]
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
32: %ascii_prefix_len.sroa.0.018 = phi i32 [ %12, %bb18.preheader ], [ 0, %bb3.preheader.preheader ]
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
33: %5 = load <8 x i8>, ptr %input.sroa.0.021, align 1
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:8'1 ? possible intended match
34: %6 = icmp slt <8 x i8> %5, zeroinitializer
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
35: %7 = bitcast <8 x i1> %6 to i8
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
36: %_20.not = icmp eq i8 %7, 0
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
37: br i1 %_20.not, label %bb18.preheader, label %bb12
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:8'0 ~
check:8'0 ~
39: bb12: ; preds = %bb18.preheader, %bb3.preheader, %"_ZN106_$LT$core..ops..range..Range$LT$usize$GT$$u20$as$u20$core..slice..index..SliceIndex$LT$$u5b$T$u5d$$GT$$GT$9index_mut17h3804f3258f871e31E.exit"
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
40: %ascii_prefix_len.sroa.0.0.lcssa = phi i32 [ 0, %"_ZN106_$LT$core..ops..range..Range$LT$usize$GT$$u20$as$u20$core..slice..index..SliceIndex$LT$$u5b$T$u5d$$GT$$GT$9index_mut17h3804f3258f871e31E.exit" ], [ %ascii_prefix_len.sroa.0.018, %bb3.preheader ], [ %4, %bb18.preheader ]
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
41: ret i32 %ascii_prefix_len.sroa.0.0.lcssa
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:8'0 ~
check:8'0 ~
43: bb18.preheader: ; preds = %bb3.preheader
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
44: %8 = add <8 x i8> %5, <i8 -65, i8 -65, i8 -65, i8 -65, i8 -65, i8 -65, i8 -65, i8 -65>
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
45: %9 = icmp ult <8 x i8> %8, <i8 26, i8 26, i8 26, i8 26, i8 26, i8 26, i8 26, i8 26>
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
46: %10 = select <8 x i1> %9, <8 x i8> <i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32>, <8 x i8> zeroinitializer
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
47: %11 = or <8 x i8> %10, %5
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
48: store <8 x i8> %11, ptr %output.sroa.0.019, align 1
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
49: %12 = add i32 %ascii_prefix_len.sroa.0.018, 8
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
50: %_62 = add i32 %input.sroa.6.020, -8
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
51: %_63 = getelementptr inbounds i8, ptr %input.sroa.0.021, i32 8
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
52: %_76 = getelementptr inbounds i8, ptr %output.sroa.0.019, i32 8
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
53: %_9 = icmp ugt i32 %_62, 7
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
54: br i1 %_9, label %bb3.preheader, label %bb12
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
55: }
check:8'0 ~~
check:8'0 ~
57: ; core::slice::index::slice_end_index_len_fail
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
58: ; Function Attrs: cold noinline noreturn uwtable
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
59: declare void @_ZN4core5slice5index24slice_end_index_len_fail17hac67c3d99c158e18E(i32 noundef, i32 noundef, ptr noalias noundef readonly align 4 dereferenceable(16)) unnamed_addr #1
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:8'0 ~
check:8'0 ~
61: ; Function Attrs: nounwind uwtable
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
62: declare noundef i32 @rust_eh_personality(i32 noundef, i32 noundef, i64 noundef, ptr noundef, ptr noundef) unnamed_addr #2
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:8'0 ~
check:8'0 ~
64: attributes #0 = { uwtable "probe-stack"="inline-asm" "target-cpu"="pentium4" }
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
65: attributes #1 = { cold noinline noreturn uwtable "probe-stack"="inline-asm" "target-cpu"="pentium4" }
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
66: attributes #2 = { nounwind uwtable "probe-stack"="inline-asm" "target-cpu"="pentium4" }
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
67: attributes #3 = { noreturn }
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:8'0 ~
check:8'0 ~
69: !llvm.module.flags = !{!0}
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
70: !llvm.ident = !{!1}
check:8'0 ~~~~~~~~~~~~~~~~~~~~
check:8'0 ~
check:8'0 ~
72: !0 = !{i32 8, !"PIC Level", i32 2}
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
73: !1 = !{!"rustc version 1.81.0-nightly (98a3a2c7b 2024-06-27)"}
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
74: !2 = !{!3}
check:8'0 ~~~~~~~~~~~
75: !3 = distinct !{!3, !4, !"_ZN106_$LT$core..ops..range..Range$LT$usize$GT$$u20$as$u20$core..slice..index..SliceIndex$LT$$u5b$T$u5d$$GT$$GT$9index_mut17h3804f3258f871e31E: %slice.0"}
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
76: !4 = distinct !{!4, !"_ZN106_$LT$core..ops..range..Range$LT$usize$GT$$u20$as$u20$core..slice..index..SliceIndex$LT$$u5b$T$u5d$$GT$$GT$9index_mut17h3804f3258f871e31E"}
check:8'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
------------------------------------------
:broken_heart: Test failed - checks-actions
The loop depends on the size of usize and i686-gnu is a 32-bit target. It supports sse2 though. So I'm not sure what the best fix would be, run the codegen test only for 64-bit targets, or make the code independent of the word size and always process 16 bytes.
2024-06-27T01:17:05.4455854Z /checkout/tests/codegen/issues/issue-123712-str-to-lower-autovectorization.rs:8:11: error: CHECK: expected string not found in input
2024-06-27T01:17:05.4456805Z // CHECK: [[A:%[0-9]]] = load <16 x i8>
2024-06-27T01:17:05.4457184Z ^
2024-06-27T01:17:05.4458416Z /checkout/obj/build/i686-unknown-linux-gnu/test/codegen/issues/issue-123712-str-to-lower-autovectorization/issue-123712-str-to-lower-autovectorization.ll:10:38: note: scanning from here
2024-06-27T01:17:05.4460401Z define noundef i32 @lower_while_ascii(ptr noalias nocapture noundef nonnull readonly align 1 %0, i32 noundef %1, ptr noalias nocapture noundef nonnull writeonly align 1 %2, i32 noundef %3) unnamed_addr #0 personality ptr @rust_eh_personality {
2024-06-27T01:17:05.4461733Z ^
2024-06-27T01:17:05.4463122Z /checkout/obj/build/i686-unknown-linux-gnu/test/codegen/issues/issue-123712-str-to-lower-autovectorization/issue-123712-str-to-lower-autovectorization.ll:33:2: note: possible intended match here
2024-06-27T01:17:05.4464356Z %5 = load <8 x i8>, ptr %input.sroa.0.021, align 1
Sizing it for SSE registers is probably fine. Afaik we don't support anything with 64bit-wide SIMD units and MMX is practically dead.
Thanks, just pushed a commit doing that. Let me know if I should squash or rebase.
I haven't succeeded yet in setting up cross-compiling to i686 locally, will have to rely on another CI run.
Oh, I thought you meant changing the implementation itself to always use 128bit wide chunks.
Some other ascii code paths use usize because they do align_to to process multiple u8s as usizes which also works as SWAR if SIMD units aren't available. But that doesn't apply here since we have input and output slices which are guaranteed to align the same way.
I haven't succeeded yet in setting up cross-compiling to i686 locally, will have to rely on another CI run.
you can use the docker scripts to execute CI jobs locally
Oh, I thought you meant changing the implementation itself to always use 128bit wide chunks.
I can also do that, assembly for i686 looks good also with 128bit chunks. 32-bit arm targets to not look too good either way, as far as I can tell.
About SWAR and 32bit ARM, did you also compare it against the code on master?
Comparing with target thumbv7neon-linux-androideabi in compiler explorer I'd say the PR with a fixed chunk size of 16 would be a win. Both do not seem to use neon very effectively. The amount of instructions is about the same, but master should only process 8 bytes per loop iteration. Without neon, PR+chunk size 16 generates a bit over twice the instructions than master, so instructions per byte should be similar.
Before updating the PR to a fixed chunk size of 16, should we also look at other targets, or trust llvm to generate good enough code? I'm even less familiar with riscv or other assembly. And I doubt the original code got so much scrutiny ;)
Without neon, PR+chunk size 16 generates a bit over twice the instructions than master, so instructions per byte should be similar.
That's with O3 which unrolls a lot. With Os the size is about the same. So I hope the embedded folks won't complain.
Good enough for now.
Thanks, updated now to a fixed chunk size. If the code size is an issue for some embedded platform, the chunked logic could be put behind the optimize_for_size feature.
@bors r+
:pushpin: Commit 623b9c8fb3b0704b42f66ec87c75a6d3fc15f8d6 has been approved by the8472
It is now in the queue for this repository.
@bors r- rollup=iffy failed in https://github.com/rust-lang/rust/pull/127143#issuecomment-2198436516