rust Improve autovectorization of to_lowercase / to

Refactor the code in the convert_while_ascii helper function to make it more suitable for auto-vectorization and also process the full ascii prefix of the string. The generic case conversion logic will only be invoked starting from the first non-ascii character.

The runtime on a microbenchmark with a small ascii-only input decreases from ~55ns to ~18ns per iteration. The new implementation also reduces the amount of unsafe code and encapsulates all unsafe inside the helper function.

Fixes #123712

Apr 11 '24 08:04 jhorstmann

r? @cuviper

rustbot has assigned @cuviper. They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

Apr 11 '24 08:04 rustbot

r? @the8472

The assembly for x86 and aarch64 can also be seen at https://rust.godbolt.org/z/x6T65nE8E

Apr 11 '24 08:04 jhorstmann

Hi @jhorstmann do you have any benchmark for this? Thx!

May 05 '24 21:05 Marcondiro

@Marcondiro only the microbenchmark included in this PR. On my machine (Intel i9-11900KB) the performance increases by nearly 3x. This is without any target-specific compiler flags, rerunning them now with:

./x.py bench library/alloc/ --stage 0 --test-args to_lowercase

Before

benchmarks:
    string::bench_to_lowercase 57.00ns/iter +/- 1.00ns

After

benchmarks:
    string::bench_to_lowercase 20.00ns/iter +/- 1.00ns

May 06 '24 07:05 jhorstmann

Cool! I tried on an M1 and the results are not as good ~~Before:~~

string::bench_to_lowercase 102.00ns/iter +/- 1.00ns

After:

string::bench_to_lowercase 99.00ns/iter +/- 1.00ns

Maybe autovec is not as powerful here idk

May 07 '24 10:05 Marcondiro

Thanks for running the benchmarks, glad that there is no regression on arm. The improvement on x86 mostly comes from the usage of the pmovmskb instruction, the equivalent on arm requires multiple instructions with higher latency/lower throughput.

May 08 '24 07:05 jhorstmann

Sorry, I was testing your solution against my fork that I am working on instead of upstream 🙈 The actual result is Before:

string::bench_to_lowercase 45.00ns/iter +/- 0.00ns

After:

string::bench_to_lowercase 99.00ns/iter +/- 1.00ns

May 08 '24 22:05 Marcondiro

Thanks again, I was able to reproduce this on an aws c7g instance. It seems there were some bounds check remaining in the generated assembly, which were removed by the compiler on x86_64 and also in the simplified version I checked in compiler explorer. Adding some assume fixes that and the benchmarks on c7g / Neoverse-V1 are now

Before

benchmarks:
    string::bench_to_lowercase 61.00ns/iter +/- 0.00ns

After

benchmarks:
    string::bench_to_lowercase 28.00ns/iter +/- 0.00ns

May 09 '24 12:05 jhorstmann

:umbrella: The latest upstream changes (presumably #124773) made this pull request unmergeable. Please resolve the merge conflicts.

May 09 '24 19:05 bors

@the8472 can you take a look at this PR? It might have been lost in the review queue.

May 13 '24 07:05 jhorstmann

A couple of thoughts:

A similar thing is done also here: https://github.com/rust-lang/rust/blob/abb95639ef2b837dbfe7b5d18f51fadda29711cb/library/core/src/slice/ascii.rs#L328-L341 maybe would be nice to somehow merge the two?
Would be interesting to understand why the compiler doesn't use pmovmskb if not writing the code in this specific way (eg. using & 0x8080808080808080)

May 13 '24 15:05 Marcondiro

Updated benchmark results:

master (#eda9d7f987de76b9d61c633a6ac328936e1b94f0)
benchmarks:
    str::to_lowercase::long_lorem_ipsum  534.48/iter +/- 33.86
    str::to_lowercase::short_ascii        20.11/iter  +/- 0.53
    str::to_lowercase::short_mixed       240.21/iter  +/- 3.40
    str::to_lowercase::short_pile_of_poo 262.54/iter  +/- 8.33

PR (#b03d93962816fd82afb619e0cf2083dc67e218e8)
benchmarks:
    str::to_lowercase::long_lorem_ipsum  148.49/iter +/- 1.24
    str::to_lowercase::short_ascii        12.14/iter +/- 0.13
    str::to_lowercase::short_mixed       240.31/iter +/- 3.51
    str::to_lowercase::short_pile_of_poo 259.42/iter +/- 7.45

Benchmark results in the initial PR comments were using a different input, these are using the same input strings as several existing benchmarks. I did not get around yet to rerunning on aarch64.

Update: Above results were with -Ctarget-cpu=native, with default target the improvements are bigger:

master
    str::to_lowercase::long_lorem_ipsum  1027.63/iter +/- 39.76
    str::to_lowercase::short_ascii         34.39/iter  +/- 0.73
    str::to_lowercase::short_mixed        262.95/iter +/- 16.07
    str::to_lowercase::short_pile_of_poo  261.71/iter +/- 15.26

PR
    str::to_lowercase::long_lorem_ipsum  175.39/iter  +/- 3.18
    str::to_lowercase::short_ascii        12.25/iter  +/- 0.30
    str::to_lowercase::short_mixed       237.15/iter +/- 11.36
    str::to_lowercase::short_pile_of_poo 262.57/iter  +/- 8.37

Jun 02 '24 20:06 jhorstmann

Benchmark results on aarch64 (Apple M1)

master:
    str::to_lowercase::long_lorem_ipsum  523.29/iter  +/- 3.10
    str::to_lowercase::short_ascii        33.31/iter  +/- 0.29
    str::to_lowercase::short_mixed       299.64/iter +/- 23.58
    str::to_lowercase::short_pile_of_poo 295.03/iter  +/- 9.97

PR:
    str::to_lowercase::long_lorem_ipsum  129.03/iter  +/- 2.37
    str::to_lowercase::short_ascii        23.27/iter  +/- 0.49
    str::to_lowercase::short_mixed       271.17/iter +/- 31.33
    str::to_lowercase::short_pile_of_poo 272.03/iter  +/- 3.60

Great results even without pmovmskb 👍

Jun 04 '24 09:06 Marcondiro

@bors r+

Jun 25 '24 22:06 the8472

:pushpin: Commit b5ea85f6949d04e9b47f075719a76c55ab9584b9 has been approved by the8472

It is now in the queue for this repository.

Jun 25 '24 22:06 bors

:hourglass: Testing commit b5ea85f6949d04e9b47f075719a76c55ab9584b9 with merge 98a3a2c7b688228ce7968f6c7d72b9c50724844a...

Jun 27 '24 00:06 bors

The job i686-gnu failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)

failures:

---- [codegen] tests/codegen/issues/issue-123712-str-to-lower-autovectorization.rs stdout ----

error: verification with 'FileCheck' failed
status: exit status: 1
command: "/checkout/obj/build/i686-unknown-linux-gnu/llvm/build/bin/FileCheck" "--input-file" "/checkout/obj/build/i686-unknown-linux-gnu/test/codegen/issues/issue-123712-str-to-lower-autovectorization/issue-123712-str-to-lower-autovectorization.ll" "/checkout/tests/codegen/issues/issue-123712-str-to-lower-autovectorization.rs" "--check-prefix=CHECK" "--check-prefix" "NONMSVC" "--allow-unused-prefixes" "--dump-input-context" "100"
--- stderr -------------------------------
/checkout/tests/codegen/issues/issue-123712-str-to-lower-autovectorization.rs:8:11: error: CHECK: expected string not found in input
/checkout/tests/codegen/issues/issue-123712-str-to-lower-autovectorization.rs:8:11: error: CHECK: expected string not found in input
// CHECK: [[A:%[0-9]]] = load <16 x i8>
          ^
/checkout/obj/build/i686-unknown-linux-gnu/test/codegen/issues/issue-123712-str-to-lower-autovectorization/issue-123712-str-to-lower-autovectorization.ll:10:38: note: scanning from here
define noundef i32 @lower_while_ascii(ptr noalias nocapture noundef nonnull readonly align 1 %0, i32 noundef %1, ptr noalias nocapture noundef nonnull writeonly align 1 %2, i32 noundef %3) unnamed_addr #0 personality ptr @rust_eh_personality {
                                     ^
/checkout/obj/build/i686-unknown-linux-gnu/test/codegen/issues/issue-123712-str-to-lower-autovectorization/issue-123712-str-to-lower-autovectorization.ll:33:2: note: possible intended match here
 %5 = load <8 x i8>, ptr %input.sroa.0.021, align 1

Input file: /checkout/obj/build/i686-unknown-linux-gnu/test/codegen/issues/issue-123712-str-to-lower-autovectorization/issue-123712-str-to-lower-autovectorization.ll
Check file: /checkout/tests/codegen/issues/issue-123712-str-to-lower-autovectorization.rs


-dump-input=help explains the following input dump.
Input was:
<<<<<<
<<<<<<
           1: ; ModuleID = 'issue_123712_str_to_lower_autovectorization.d6ee0dc286ccc26c-cgu.0' 
           2: source_filename = "issue_123712_str_to_lower_autovectorization.d6ee0dc286ccc26c-cgu.0" 
           3: target datalayout = "e-m:e-p:32:32-p270:32:32-p271:32:32-p272:64:64-i128:128-f64:32:64-f80:32-n8:16:32-S128" 
           4: target triple = "i686-unknown-linux-gnu" 
           5:  
           6: @alloc_9aa1bc107ccd5c4e0fb094a54d1f6c40 = private unnamed_addr constant <{ [77 x i8] }> <{ [77 x i8] c"/checkout/tests/codegen/issues/issue-123712-str-to-lower-autovectorization.rs" }>, align 1 
           7: @alloc_e04a128e070b1881e32fb6e8157ed5f8 = private unnamed_addr constant <{ ptr, [12 x i8] }> <{ ptr @alloc_9aa1bc107ccd5c4e0fb094a54d1f6c40, [12 x i8] c"M\00\00\00\12\00\00\00\19\00\00\00" }>, align 4 
           9: ; Function Attrs: uwtable 
           9: ; Function Attrs: uwtable 
          10: define noundef i32 @lower_while_ascii(ptr noalias nocapture noundef nonnull readonly align 1 %0, i32 noundef %1, ptr noalias nocapture noundef nonnull writeonly align 1 %2, i32 noundef %3) unnamed_addr #0 personality ptr @rust_eh_personality { 
check:8'0                                          X~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ error: no match found
check:8'0     ~~~~~~~
check:8'0     ~~~~~~~
          12:  %_8.i = icmp ugt i32 %1, %3 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          13:  br i1 %_8.i, label %bb1.i, label %"_ZN106_$LT$core..ops..range..Range$LT$usize$GT$$u20$as$u20$core..slice..index..SliceIndex$LT$$u5b$T$u5d$$GT$$GT$9index_mut17h3804f3258f871e31E.exit" 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:8'0     ~
check:8'0     ~
          15: bb1.i: ; preds = %start 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~
          16: ; call core::slice::index::slice_end_index_len_fail 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          17:  tail call void @_ZN4core5slice5index24slice_end_index_len_fail17hac67c3d99c158e18E(i32 noundef %1, i32 noundef %3, ptr noalias noundef nonnull readonly align 4 dereferenceable(16) @alloc_e04a128e070b1881e32fb6e8157ed5f8) #3, !noalias !2 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          18:  unreachable 
check:8'0     ~~~~~~~~~~~~~
check:8'0     ~
check:8'0     ~
          20: "_ZN106_$LT$core..ops..range..Range$LT$usize$GT$$u20$as$u20$core..slice..index..SliceIndex$LT$$u5b$T$u5d$$GT$$GT$9index_mut17h3804f3258f871e31E.exit": ; preds = %start 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          21:  %_917 = icmp ugt i32 %1, 7 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          22:  br i1 %_917, label %bb3.preheader.preheader, label %bb12 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:8'0     ~
check:8'0     ~
          24: bb3.preheader.preheader: ; preds = %"_ZN106_$LT$core..ops..range..Range$LT$usize$GT$$u20$as$u20$core..slice..index..SliceIndex$LT$$u5b$T$u5d$$GT$$GT$9index_mut17h3804f3258f871e31E.exit" 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          25:  %4 = and i32 %1, -8 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~
          26:  br label %bb3.preheader 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~
check:8'0     ~
check:8'0     ~
          28: bb3.preheader: ; preds = %bb3.preheader.preheader, %bb18.preheader 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          29:  %input.sroa.0.021 = phi ptr [ %_63, %bb18.preheader ], [ %0, %bb3.preheader.preheader ] 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          30:  %input.sroa.6.020 = phi i32 [ %_62, %bb18.preheader ], [ %1, %bb3.preheader.preheader ] 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          31:  %output.sroa.0.019 = phi ptr [ %_76, %bb18.preheader ], [ %2, %bb3.preheader.preheader ] 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          32:  %ascii_prefix_len.sroa.0.018 = phi i32 [ %12, %bb18.preheader ], [ 0, %bb3.preheader.preheader ] 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          33:  %5 = load <8 x i8>, ptr %input.sroa.0.021, align 1 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:8'1      ?                                                   possible intended match
          34:  %6 = icmp slt <8 x i8> %5, zeroinitializer 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          35:  %7 = bitcast <8 x i1> %6 to i8 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          36:  %_20.not = icmp eq i8 %7, 0 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          37:  br i1 %_20.not, label %bb18.preheader, label %bb12 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:8'0     ~
check:8'0     ~
          39: bb12: ; preds = %bb18.preheader, %bb3.preheader, %"_ZN106_$LT$core..ops..range..Range$LT$usize$GT$$u20$as$u20$core..slice..index..SliceIndex$LT$$u5b$T$u5d$$GT$$GT$9index_mut17h3804f3258f871e31E.exit" 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          40:  %ascii_prefix_len.sroa.0.0.lcssa = phi i32 [ 0, %"_ZN106_$LT$core..ops..range..Range$LT$usize$GT$$u20$as$u20$core..slice..index..SliceIndex$LT$$u5b$T$u5d$$GT$$GT$9index_mut17h3804f3258f871e31E.exit" ], [ %ascii_prefix_len.sroa.0.018, %bb3.preheader ], [ %4, %bb18.preheader ] 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          41:  ret i32 %ascii_prefix_len.sroa.0.0.lcssa 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:8'0     ~
check:8'0     ~
          43: bb18.preheader: ; preds = %bb3.preheader 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          44:  %8 = add <8 x i8> %5, <i8 -65, i8 -65, i8 -65, i8 -65, i8 -65, i8 -65, i8 -65, i8 -65> 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          45:  %9 = icmp ult <8 x i8> %8, <i8 26, i8 26, i8 26, i8 26, i8 26, i8 26, i8 26, i8 26> 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          46:  %10 = select <8 x i1> %9, <8 x i8> <i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32>, <8 x i8> zeroinitializer 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          47:  %11 = or <8 x i8> %10, %5 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~
          48:  store <8 x i8> %11, ptr %output.sroa.0.019, align 1 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          49:  %12 = add i32 %ascii_prefix_len.sroa.0.018, 8 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          50:  %_62 = add i32 %input.sroa.6.020, -8 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          51:  %_63 = getelementptr inbounds i8, ptr %input.sroa.0.021, i32 8 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          52:  %_76 = getelementptr inbounds i8, ptr %output.sroa.0.019, i32 8 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          53:  %_9 = icmp ugt i32 %_62, 7 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          54:  br i1 %_9, label %bb3.preheader, label %bb12 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          55: } 
check:8'0     ~~
check:8'0     ~
          57: ; core::slice::index::slice_end_index_len_fail 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          58: ; Function Attrs: cold noinline noreturn uwtable 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          59: declare void @_ZN4core5slice5index24slice_end_index_len_fail17hac67c3d99c158e18E(i32 noundef, i32 noundef, ptr noalias noundef readonly align 4 dereferenceable(16)) unnamed_addr #1 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:8'0     ~
check:8'0     ~
          61: ; Function Attrs: nounwind uwtable 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          62: declare noundef i32 @rust_eh_personality(i32 noundef, i32 noundef, i64 noundef, ptr noundef, ptr noundef) unnamed_addr #2 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:8'0     ~
check:8'0     ~
          64: attributes #0 = { uwtable "probe-stack"="inline-asm" "target-cpu"="pentium4" } 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          65: attributes #1 = { cold noinline noreturn uwtable "probe-stack"="inline-asm" "target-cpu"="pentium4" } 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          66: attributes #2 = { nounwind uwtable "probe-stack"="inline-asm" "target-cpu"="pentium4" } 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          67: attributes #3 = { noreturn } 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:8'0     ~
check:8'0     ~
          69: !llvm.module.flags = !{!0} 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~
          70: !llvm.ident = !{!1} 
check:8'0     ~~~~~~~~~~~~~~~~~~~~
check:8'0     ~
check:8'0     ~
          72: !0 = !{i32 8, !"PIC Level", i32 2} 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          73: !1 = !{!"rustc version 1.81.0-nightly (98a3a2c7b 2024-06-27)"} 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          74: !2 = !{!3} 
check:8'0     ~~~~~~~~~~~
          75: !3 = distinct !{!3, !4, !"_ZN106_$LT$core..ops..range..Range$LT$usize$GT$$u20$as$u20$core..slice..index..SliceIndex$LT$$u5b$T$u5d$$GT$$GT$9index_mut17h3804f3258f871e31E: %slice.0"} 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          76: !4 = distinct !{!4, !"_ZN106_$LT$core..ops..range..Range$LT$usize$GT$$u20$as$u20$core..slice..index..SliceIndex$LT$$u5b$T$u5d$$GT$$GT$9index_mut17h3804f3258f871e31E"} 
check:8'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
------------------------------------------

Jun 27 '24 01:06 rust-log-analyzer

:broken_heart: Test failed - checks-actions

Jun 27 '24 01:06 bors

The loop depends on the size of usize and i686-gnu is a 32-bit target. It supports sse2 though. So I'm not sure what the best fix would be, run the codegen test only for 64-bit targets, or make the code independent of the word size and always process 16 bytes.

2024-06-27T01:17:05.4455854Z /checkout/tests/codegen/issues/issue-123712-str-to-lower-autovectorization.rs:8:11: error: CHECK: expected string not found in input
2024-06-27T01:17:05.4456805Z // CHECK: [[A:%[0-9]]] = load <16 x i8>
2024-06-27T01:17:05.4457184Z           ^
2024-06-27T01:17:05.4458416Z /checkout/obj/build/i686-unknown-linux-gnu/test/codegen/issues/issue-123712-str-to-lower-autovectorization/issue-123712-str-to-lower-autovectorization.ll:10:38: note: scanning from here
2024-06-27T01:17:05.4460401Z define noundef i32 @lower_while_ascii(ptr noalias nocapture noundef nonnull readonly align 1 %0, i32 noundef %1, ptr noalias nocapture noundef nonnull writeonly align 1 %2, i32 noundef %3) unnamed_addr #0 personality ptr @rust_eh_personality {
2024-06-27T01:17:05.4461733Z                                      ^
2024-06-27T01:17:05.4463122Z /checkout/obj/build/i686-unknown-linux-gnu/test/codegen/issues/issue-123712-str-to-lower-autovectorization/issue-123712-str-to-lower-autovectorization.ll:33:2: note: possible intended match here
2024-06-27T01:17:05.4464356Z  %5 = load <8 x i8>, ptr %input.sroa.0.021, align 1

Jun 27 '24 09:06 jhorstmann

Sizing it for SSE registers is probably fine. Afaik we don't support anything with 64bit-wide SIMD units and MMX is practically dead.

Jun 27 '24 22:06 the8472

Thanks, just pushed a commit doing that. Let me know if I should squash or rebase.

I haven't succeeded yet in setting up cross-compiling to i686 locally, will have to rely on another CI run.

Jun 28 '24 08:06 jhorstmann

Oh, I thought you meant changing the implementation itself to always use 128bit wide chunks.

Some other ascii code paths use usize because they do align_to to process multiple u8s as usizes which also works as SWAR if SIMD units aren't available. But that doesn't apply here since we have input and output slices which are guaranteed to align the same way.

I haven't succeeded yet in setting up cross-compiling to i686 locally, will have to rely on another CI run.

you can use the docker scripts to execute CI jobs locally

Jun 28 '24 08:06 the8472

Oh, I thought you meant changing the implementation itself to always use 128bit wide chunks.

I can also do that, assembly for i686 looks good also with 128bit chunks. 32-bit arm targets to not look too good either way, as far as I can tell.

Jun 28 '24 17:06 jhorstmann

About SWAR and 32bit ARM, did you also compare it against the code on master?

Jun 28 '24 17:06 the8472

Comparing with target thumbv7neon-linux-androideabi in compiler explorer I'd say the PR with a fixed chunk size of 16 would be a win. Both do not seem to use neon very effectively. The amount of instructions is about the same, but master should only process 8 bytes per loop iteration. Without neon, PR+chunk size 16 generates a bit over twice the instructions than master, so instructions per byte should be similar.

Before updating the PR to a fixed chunk size of 16, should we also look at other targets, or trust llvm to generate good enough code? I'm even less familiar with riscv or other assembly. And I doubt the original code got so much scrutiny ;)

Jun 28 '24 21:06 jhorstmann

Without neon, PR+chunk size 16 generates a bit over twice the instructions than master, so instructions per byte should be similar.

That's with O3 which unrolls a lot. With Os the size is about the same. So I hope the embedded folks won't complain.

Good enough for now.

Jun 28 '24 22:06 the8472

Thanks, updated now to a fixed chunk size. If the code size is an issue for some embedded platform, the chunked logic could be put behind the optimize_for_size feature.

Jun 29 '24 13:06 jhorstmann

@bors r+

Jun 29 '24 19:06 the8472

:pushpin: Commit 623b9c8fb3b0704b42f66ec87c75a6d3fc15f8d6 has been approved by the8472

It is now in the queue for this repository.

Jun 29 '24 19:06 bors

@bors r- rollup=iffy failed in https://github.com/rust-lang/rust/pull/127143#issuecomment-2198436516

Jun 30 '24 06:06 matthiaskrgr

Improve autovectorization of to_lowercase / to_uppercase functions