runtime icon indicating copy to clipboard operation
runtime copied to clipboard

Arm64: Add If conversion pass

Open a74nh opened this issue 3 years ago • 21 comments

This patch builds upon the work in https://github.com/dotnet/runtime/pull/71705

A single condition can be optimised by replacing a JTRUE with a SELECT and removing the control flow.

For example:

if (op1 > 0) { op1 = 5; }
IN0001: 000008                    mov     w2, #5
IN0002: 00000C                    cmp     w0, #0
IN0003: 000010                    csel    w0, w0, w2, eq

Additional JTRUE statements can be replaced by placing AND chains onto the conditional part of a SELECT.

For example:

if (op1 > 3 && op2 != 10 && op3 < 7) { op1 = 5; }
IN0001: 000008                    cmp     w2, #7
IN0002: 00000C                    ccmp    w1, #10, z, lt
IN0003: 000010                    ccmp    w0, #3, nzc, ne
IN0004: 000014                    mov     w2, #5
IN0005: 000018                    csel    w0, w2, w0, gt

Performance tests are here: https://github.com/dotnet/performance/pull/2517

a74nh avatar Aug 05 '22 16:08 a74nh

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details

This patch builds upon the work in https://github.com/dotnet/runtime/pull/71705

A single condition can be optimised by replacing a JTRUE with a SELECT and removing the control flow.

For example:

if (op1 > 0) { op1 = 5; }
IN0001: 000008                    mov     w2, dotnet/perf-autofiling-issues#5
IN0002: 00000C                    cmp     w0, #0
IN0003: 000010                    csel    w0, w0, w2, eq

Additional JTRUE statements can be replaced by placing AND chains onto the conditional part of a SELECT.

For example:

if (op1 > 3 && op2 != 10 && op3 < 7) { op1 = 5; }
IN0001: 000008                    cmp     w2, dotnet/runtime#56402
IN0002: 00000C                    ccmp    w1, dotnet/perf-autofiling-issues#10, z, lt
IN0003: 000010                    ccmp    w0, dotnet/perf-autofiling-issues#3, nzc, ne
IN0004: 000014                    mov     w2, dotnet/perf-autofiling-issues#5
IN0005: 000018                    csel    w0, w2, w0, gt

Performance tests are here: https://github.com/dotnet/performance/pull/2517

Author: a74nh
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

ghost avatar Aug 05 '22 16:08 ghost

The first commit in this PR is a copy of https://github.com/dotnet/runtime/pull/71705. Once that PR is merged I will remove that commit from this patch. This was just so I could get this out a little earlier.

@kunalspathak @jakobbotsch

Also, this probably needs an additional check in the optimizer pass to limit the max size of the chains.

a74nh avatar Aug 05 '22 16:08 a74nh

PR is now rebased on top of HEAD, and is good for a review. Taking this out of draft.

I do have the one failure being hit a few times in the testsuite - debugging this now. However, I'm not expecting that to make big changes to the patch.

I'm away for two weeks from Monday. Getting this in before then feels unlikely. However, it'd be good to know how close it is.

@kunalspathak @jakobbotsch

a74nh avatar Aug 10 '22 11:08 a74nh

The diffs look like size wise regressions, why is that? I would expect this pattern should give smaller code in most cases.

For all the cases where the code is longer that will be due to no longer using cbz/cbnz instructions:

cbnz x0, LABEL1
mov x1, 5
LABEL1:

now becomes:

cmp x0, 0
mov x2, 5
csel x1, x2, x1, eq

Which is one instruction longer. Assuming the branch taken/not taken is fairly random, then the new code will be faster. I should probably extend https://github.com/dotnet/performance/pull/2517 to make sure those cases are being tested.

a74nh avatar Aug 11 '22 08:08 a74nh

Assuming the branch taken/not taken is fairly random, then the new code will be faster.

I'm not sure this is a fair assumption in practice. Can you post some micro benchmarks for all cases (both randomly taken branches and biased branches)?

jakobbotsch avatar Aug 15 '22 08:08 jakobbotsch

/azp run Fuzzlyn

jakobbotsch avatar Aug 15 '22 10:08 jakobbotsch

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Aug 15 '22 10:08 azure-pipelines[bot]

As a side point - when rebasing is it best to pull/rebase/forcepush or is a merge commit preferred?

a74nh avatar Sep 02 '22 11:09 a74nh

As a side point - when rebasing is it best to pull/rebase/forcepush or is a merge commit preferred?

Merge commits are preferred (essentially, it is easiest for reviewers if you never force push a non-draft PR).

jakobbotsch avatar Sep 02 '22 12:09 jakobbotsch

Assuming the branch taken/not taken is fairly random, then the new code will be faster.

I'm not sure this is a fair assumption in practice. Can you post some micro benchmarks for all cases (both randomly taken branches and biased branches)?

I expanded the test Single() in the If Statement benchmarks that I previously added to the performance repo.

https://github.com/dotnet/performance/blob/5f63e8dcf81a0f9ddf5a80bd791d1fbaf0babd61/src/benchmarks/micro/runtime/Statements/If.cs#L46

        [Benchmark]
        public void Single() {
            for (int i = 0; i < Iterations; i++) {
                SingleInner(inputs[i]);
            }
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        public static void SingleInner(int op1) {
            if (op1 % 2 == 0) {
                op1 = 5;
            }
            Consume(op1, 0, 0, 0);
        }

(I'll post a link to the patch later)

For this first set: Input is random data. Mod 2 in the first test gives roughly 50/50 split. Each following test increases the mod, giving more weight to jumping over the condition. Eg: Single(): if (op1 % 2 == 0) { op1 = 5; } Single3(): if (op1 % 3 == 0) { op1 = 5; } Single4(): if (op1 % 4 == 0) { op1 = 5; } etc

Performance for the ~50/50 is 46% faster, and then drops off as the taken/not taken moves away from 50/50. But even with mod6 it's still 24% faster.

|         Method |        Job |      Toolchain |     Mean |    Error |   StdDev |   Median |      Min |      Max | Ratio | MannWhitney(2%) |
|--------------- |----------- |----------------------------------------------------------------------------------------------------------- |
|         Single | Job-BEWOVY | /runtime_HEAD/ | 62.27 us | 0.034 us | 0.028 us | 62.28 us | 62.22 us | 62.31 us |  1.00 |            Base |
|         Single | Job-EDSOJO | /runtime_csel/ | 33.65 us | 0.110 us | 0.097 us | 33.65 us | 33.48 us | 33.82 us |  0.54 |          Faster |
|                |            |                                                                                                            |
|        Single3 | Job-BEWOVY | /runtime_HEAD/ | 65.84 us | 0.069 us | 0.061 us | 65.82 us | 65.77 us | 65.96 us |  1.00 |            Base |
|        Single3 | Job-EDSOJO | /runtime_csel/ | 39.42 us | 0.032 us | 0.026 us | 39.41 us | 39.39 us | 39.48 us |  0.60 |          Faster |
|                |            |                                                                                                            |
|        Single4 | Job-BEWOVY | /runtime_HEAD/ | 52.70 us | 0.011 us | 0.009 us | 52.70 us | 52.69 us | 52.72 us |  1.00 |            Base |
|        Single4 | Job-EDSOJO | /runtime_csel/ | 33.69 us | 0.168 us | 0.140 us | 33.75 us | 33.40 us | 33.84 us |  0.64 |          Faster |
|                |            |                                                                                                            |
|        Single5 | Job-BEWOVY | /runtime_HEAD/ | 56.68 us | 0.057 us | 0.050 us | 56.70 us | 56.58 us | 56.72 us |  1.00 |            Base |
|        Single5 | Job-EDSOJO | /runtime_csel/ | 39.75 us | 0.014 us | 0.013 us | 39.75 us | 39.73 us | 39.77 us |  0.70 |          Faster |
|                |            |                                                                                                            |
|        Single6 | Job-BEWOVY | /runtime_HEAD/ | 51.56 us | 0.025 us | 0.022 us | 51.57 us | 51.50 us | 51.58 us |  1.00 |            Base |
|        Single6 | Job-EDSOJO | /runtime_csel/ | 39.42 us | 0.027 us | 0.023 us | 39.41 us | 39.37 us | 39.45 us |  0.76 |          Faster |

For this set, I reverse the functions, so that the jump is increasingly likely to happen. Eg: Single(): if (op1 % 2 != 0) { op1 = 5; } SingleRev3(): if (op1 % 3 != 0) { op1 = 5; } etc As expected, results are the same as above.

|                |            |                                                                                                            |
|      SingleRev | Job-BEWOVY | /runtime_HEAD/ | 62.70 us | 0.048 us | 0.040 us | 62.69 us | 62.65 us | 62.77 us |  1.00 |            Base |
|      SingleRev | Job-EDSOJO | /runtime_csel/ | 33.62 us | 0.039 us | 0.032 us | 33.64 us | 33.58 us | 33.67 us |  0.54 |          Faster |
|                |            |                                                                                                            |
|     SingleRev3 | Job-BEWOVY | /runtime_HEAD/ | 64.90 us | 0.023 us | 0.019 us | 64.90 us | 64.86 us | 64.93 us |  1.00 |            Base |
|     SingleRev3 | Job-EDSOJO | /runtime_csel/ | 39.38 us | 0.028 us | 0.023 us | 39.38 us | 39.34 us | 39.42 us |  0.61 |          Faster |
|                |            |                                                                                                            |
|     SingleRev4 | Job-BEWOVY | /runtime_HEAD/ | 51.89 us | 0.192 us | 0.171 us | 51.80 us | 51.75 us | 52.18 us |  1.00 |            Base |
|     SingleRev4 | Job-EDSOJO | /runtime_csel/ | 33.64 us | 0.195 us | 0.173 us | 33.74 us | 33.40 us | 33.82 us |  0.65 |          Faster |
|                |            |                                                                                                            |
|     SingleRev5 | Job-BEWOVY | /runtime_HEAD/ | 57.31 us | 0.020 us | 0.018 us | 57.31 us | 57.28 us | 57.34 us |  1.00 |            Base |
|     SingleRev5 | Job-EDSOJO | /runtime_csel/ | 40.00 us | 0.530 us | 0.442 us | 39.75 us | 39.68 us | 40.65 us |  0.70 |          Faster |
|                |            |                                                                                                            |
|     SingleRev6 | Job-BEWOVY | /runtime_HEAD/ | 51.72 us | 0.027 us | 0.024 us | 51.71 us | 51.69 us | 51.76 us |  1.00 |            Base |
|     SingleRev6 | Job-EDSOJO | /runtime_csel/ | 39.42 us | 0.025 us | 0.021 us | 39.42 us | 39.39 us | 39.46 us |  0.76 |          Faster |

For the last two sets, I use exactly the same functions as above, but the input data is sequential instead of random. SingleSeq will alternate exactly 50/50: pass, fail, pass, fail... SingleSeq3 will pass, fail, fail, pass, fail, fail... SingleSeq4 will pass, fail, fail, fail, pass, fail, fail fail... etc

Here we see performance gets worse, but only slightly - worse case is 6% slower. Oddly, one of the cases is faster.

|      SingleSeq | Job-BEWOVY | /runtime_HEAD/ | 33.42 us | 0.066 us | 0.055 us | 33.41 us | 33.37 us | 33.56 us |  1.00 |            Base |
|      SingleSeq | Job-EDSOJO | /runtime_csel/ | 33.62 us | 0.116 us | 0.103 us | 33.67 us | 33.39 us | 33.71 us |  1.01 |            Same |
|                |            |                                                                                                            |
|     SingleSeq3 | Job-BEWOVY | /runtime_HEAD/ | 37.51 us | 0.020 us | 0.018 us | 37.51 us | 37.48 us | 37.54 us |  1.00 |            Base |
|     SingleSeq3 | Job-EDSOJO | /runtime_csel/ | 39.42 us | 0.015 us | 0.014 us | 39.42 us | 39.40 us | 39.45 us |  1.05 |          Slower |
|                |            |                                                                                                            |
|     SingleSeq4 | Job-BEWOVY | /runtime_HEAD/ | 35.04 us | 0.004 us | 0.003 us | 35.04 us | 35.03 us | 35.04 us |  1.00 |            Base |
|     SingleSeq4 | Job-EDSOJO | /runtime_csel/ | 33.65 us | 0.124 us | 0.110 us | 33.69 us | 33.38 us | 33.73 us |  0.96 |          Faster |
|                |            |                                                                                                            |
|     SingleSeq5 | Job-BEWOVY | /runtime_HEAD/ | 38.71 us | 0.258 us | 0.229 us | 38.88 us | 38.41 us | 38.93 us |  1.00 |            Base |
|     SingleSeq5 | Job-EDSOJO | /runtime_csel/ | 39.68 us | 0.021 us | 0.018 us | 39.69 us | 39.65 us | 39.71 us |  1.03 |            Same |
|                |            |                                                                                                            |
|     SingleSeq6 | Job-BEWOVY | /runtime_HEAD/ | 37.26 us | 0.017 us | 0.013 us | 37.25 us | 37.24 us | 37.29 us |  1.00 |            Base |
|     SingleSeq6 | Job-EDSOJO | /runtime_csel/ | 39.43 us | 0.020 us | 0.017 us | 39.43 us | 39.40 us | 39.47 us |  1.06 |          Slower |
|   SingleSeqRev | Job-BEWOVY | /runtime_HEAD/ | 33.51 us | 0.204 us | 0.170 us | 33.42 us | 33.37 us | 33.82 us |  1.00 |            Base |
|   SingleSeqRev | Job-EDSOJO | /runtime_csel/ | 33.58 us | 0.124 us | 0.110 us | 33.63 us | 33.38 us | 33.67 us |  1.00 |            Same |
|                |            |                                                                                                            |
|  SingleSeqRev3 | Job-BEWOVY | /runtime_HEAD/ | 38.10 us | 0.031 us | 0.026 us | 38.10 us | 38.04 us | 38.14 us |  1.00 |            Base |
|  SingleSeqRev3 | Job-EDSOJO | /runtime_csel/ | 39.34 us | 0.024 us | 0.021 us | 39.34 us | 39.31 us | 39.38 us |  1.03 |          Slower |
|                |            |                                                                                                            |
|  SingleSeqRev4 | Job-BEWOVY | /runtime_HEAD/ | 33.36 us | 0.003 us | 0.002 us | 33.36 us | 33.36 us | 33.37 us |  1.00 |            Base |
|  SingleSeqRev4 | Job-EDSOJO | /runtime_csel/ | 33.60 us | 0.170 us | 0.151 us | 33.68 us | 33.37 us | 33.73 us |  1.01 |            Same |
|                |            |                                                                                                            |
|  SingleSeqRev5 | Job-BEWOVY | /runtime_HEAD/ | 39.04 us | 0.053 us | 0.050 us | 39.06 us | 38.93 us | 39.09 us |  1.00 |            Base |
|  SingleSeqRev5 | Job-EDSOJO | /runtime_csel/ | 39.68 us | 0.017 us | 0.015 us | 39.68 us | 39.65 us | 39.71 us |  1.02 |            Same |
|                |            |                                                                                                            |
|  SingleSeqRev6 | Job-BEWOVY | /runtime_HEAD/ | 37.91 us | 0.030 us | 0.026 us | 37.91 us | 37.87 us | 37.96 us |  1.00 |            Base |
|  SingleSeqRev6 | Job-EDSOJO | /runtime_csel/ | 39.37 us | 0.029 us | 0.026 us | 39.37 us | 39.33 us | 39.43 us |  1.04 |          Slower |

These are all on an Altra. With older Arm hardware the performance gains get smaller as the hardware gets older.

a74nh avatar Sep 05 '22 13:09 a74nh

What does it look like for always taken/always not taken branches?

How does it affect Queue<T>.Enqueue/Queue<T>.Dequeue in a loop? It uses this function: https://github.com/dotnet/runtime/blob/8fe12bc5a66a5ef42200a3d88d68b5991eb158e6/src/libraries/System.Private.CoreLib/src/System/Collections/Generic/Queue.cs#L344-L356 Does the change get this case?

And thanks for posting the benchmarks 🙂

jakobbotsch avatar Sep 05 '22 13:09 jakobbotsch

What does it look like for always taken/always not taken branches?

Not so great.... For both of these it's mod2 with inputs of all zeros. SingleSeqAlways - Always passes the condition. 7% slower. SingleSeqAlwaysNever - Never passes the condition, 12% slower

|               Method |        Job |      Toolchain |     Mean |    Error |   StdDev |   Median |      Min |      Max | Ratio | MannWhitney(2%) |
|--------------------- |----------- |----------------------------------------------------------------------------------------------------------- |
|      SingleSeqAlways | Job-QIHMDO | /runtime_HEAD/ | 31.38 us | 0.063 us | 0.056 us | 31.38 us | 31.30 us | 31.48 us |  1.00 |            Base |
|      SingleSeqAlways | Job-SMJRRR | /runtime_csel/ | 33.60 us | 0.037 us | 0.031 us | 33.61 us | 33.55 us | 33.66 us |  1.07 |          Slower |
|                      |            |                                                                                                            |
| SingleSeqAlwaysNever | Job-QIHMDO | /runtime_HEAD/ | 30.10 us | 0.060 us | 0.047 us | 30.11 us | 30.05 us | 30.18 us |  1.00 |            Base |
| SingleSeqAlwaysNever | Job-SMJRRR | /runtime_csel/ | 33.58 us | 0.127 us | 0.106 us | 33.61 us | 33.40 us | 33.71 us |  1.12 |          Slower |

How does it affect Queue<T>.Enqueue/Queue<T>.Dequeue in a loop? It uses this function:

https://github.com/dotnet/runtime/blob/8fe12bc5a66a5ef42200a3d88d68b5991eb158e6/src/libraries/System.Private.CoreLib/src/System/Collections/Generic/Queue.cs#L344-L356

Does the change get this case? And thanks for posting the benchmarks 🙂

I extracted this into a test case and it fails to If Convert:

***** BB01
STMT00001 ( 0x005[E-] ... 0x00E )
N008 ( 10,  9) [000011] ---XG------                         *  JTRUE     void   <l:$1ca, c:$1cb>
N007 (  8,  7) [000010] N--XG--N-U-                         \--*  NE        int    <l:$20b, c:$20c>
N005 (  6,  5) [000009] ---XG------                            +--*  ARR_LENGTH int    <l:$207, c:$208>
N004 (  4,  3) [000008] ---XG------                            |  \--*  IND       ref    <l:$1c4, c:$1c5>
N003 (  3,  4) [000021] -------N---                            |     \--*  ADD       byref  $280
N001 (  1,  1) [000007] -----------                            |        +--*  LCL_VAR   ref    V00 this         u:1 (last use) $80
N002 (  1,  2) [000020] -----------                            |        \--*  CNS_INT   long   8 Fseq[_array] $240
N006 (  1,  1) [000006] -----------                            \--*  LCL_VAR   int    V02 loc0         u:2 <l:$202, c:$203>

The NE fails on a flags GTF_SIDE_EFFECT not set check - as both GTF_EXCEPT and GTF_GLOB_REF are set. As I understood it, it wouldn't be safe to move that? It's a shame if it can't as that will stop all member data being used.

a74nh avatar Sep 05 '22 14:09 a74nh

As I understood it, it wouldn't be safe to move that?

It would be fine to do the transformation regardless of side effects on the condition. It's what I pointed out in https://github.com/dotnet/runtime/pull/73472#discussion_r942474901. (Edit: I suppose this is true only after the and-chains have been removed.)

jakobbotsch avatar Sep 05 '22 14:09 jakobbotsch

Not so great.... For both of these it's mod2 with inputs of all zeros. SingleSeqAlways - Always passes the condition. 7% slower. SingleSeqAlwaysNever - Never passes the condition, 12% slower

It's a bit worrying indeed. We probably need some heuristics and perhaps see if we can use PGO data, though at this late stage in the JIT the PGO data might not be all that reliable.

jakobbotsch avatar Sep 05 '22 14:09 jakobbotsch

As I understood it, it wouldn't be safe to move that?

It would be fine to do the transformation regardless of side effects on the condition. It's what I pointed out in #73472 (comment). (Edit: I suppose this is true only after the and-chains have been removed.)

Ahh, I missed that part.

I suppose this is true only after the and-chains have been removed

Are you suggesting that this check can't be removed until the CCMP patch has also been done?

Regardless, I tried this out....

Creating a quick benchmark:

        [Benchmark]
        public void MoveNext() {
            int index = 0;
            for (int i = 0; i < Iterations; i++) {
                MoveNextInner(ref index);
            }
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        private void MoveNextInner(ref int index)
        {
            // It is tempting to use the remainder operator here but it is actually much slower
            // than a simple comparison and a rarely taken branch.
            // JIT produces better code than with ternary operator ?:
            int tmp = index + 1;
            if (tmp == _array.Length)
            {
                tmp = 0;
            }
            index = tmp;
        }

With 10,000 itterations and array size of 5, gives us 12% speedup:

|   Method |        Job |      Toolchain |     Mean |    Error |   StdDev |   Median |      Min |      Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|--------- |----------- |----------------|---------:|---------:|---------:|---------:|---------:|---------:|------:|---------------- |----------:|------------:|
| MoveNext | Job-OVNAZN | /runtime_HEAD/ | 26.29 us | 0.001 us | 0.001 us | 26.29 us | 26.28 us | 26.29 us |  1.00 |            Base |         - |          NA |
| MoveNext | Job-STXUQS | /runtime_csel/ | 23.03 us | 0.001 us | 0.001 us | 23.03 us | 23.03 us | 23.03 us |  0.88 |          Faster |         - |          NA |

I increased array size to 300, and it still does better than I expected, at 9% better:

|   Method |        Job |      Toolchain |     Mean |    Error |   StdDev |   Median |      Min |      Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|--------- |----------- |--------------- |---------:|---------:|---------:|---------:|---------:|---------:|------:|---------------- |----------:|------------:|
| MoveNext | Job-LENSWQ | /runtime_HEAD/ | 26.29 us | 0.002 us | 0.002 us | 26.29 us | 26.28 us | 26.29 us |  1.00 |            Base |         - |          NA |
| MoveNext | Job-CZJSLJ | /runtime_csel/ | 23.96 us | 0.006 us | 0.005 us | 23.96 us | 23.95 us | 23.97 us |  0.91 |          Faster |         - |          NA |

a74nh avatar Sep 05 '22 15:09 a74nh

As I understood it, it wouldn't be safe to move that?

It would be fine to do the transformation regardless of side effects on the condition. It's what I pointed out in #73472 (comment). (Edit: I suppose this is true only after the and-chains have been removed.)

Ahh, I missed that part.

I suppose this is true only after the and-chains have been removed

Are you suggesting that this check can't be removed until the CCMP patch has also been done?

Actually, I think it's safe for this PR. The transformation is only ever squashing together a JTRUE and an assignment directly connected to it. It doesn't matter if doing the check causes a side effect, because the condition always had to be checked.

Once we get to AND chains then the AND chain can't have side conditions inside the chain (due to always having to evaluate everything in the chain).

a74nh avatar Sep 05 '22 16:09 a74nh

Are you suggesting that this check can't be removed until the CCMP patch has also been done?

No, it should be fine now. I was not sure if this check used to be guarding just the if-conversion or also the conversion of short-circuiting AND to bitwise AND.

jakobbotsch avatar Sep 05 '22 16:09 jakobbotsch

The new benchmarks look good. I am also a little surprised that this is beneficial even at 300 items. What is the diff of the code? (Edit: I see now this is a micro benchmark of just this code. Can you show the diff for Queue<int>.Enqueue/Dequeue where this function is going to be inlined into?)

jakobbotsch avatar Sep 05 '22 16:09 jakobbotsch

What is the diff of the code?

This is my full set of benchmark changes: https://github.com/a74nh/performance/commit/b1c00bafdcb19edcc12640b8e3e2f16752277e80 (Delay sharing was just internal process on my end that we need to go through before sharing)

Queue.Enqueue/Dequeue

Perf_PriorityQueue already has some tests for this. However my results aren't stable - I'm getting some large errors in the mix. Here's a few different runs:



|              Method |        Job |      Toolchain | Size |         Mean |       Error |      StdDev |       Median |          Min |          Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|-------------------- |----------- |--------------- |----- |-------------:|------------:|------------:|-------------:|-------------:|-------------:|------:|---------------- |----------:|------------:|
| Dequeue_And_Enqueue | Job-VTKOWH | /runtime_HEAD/ |   10 |     916.2 ns |     4.66 ns |     4.35 ns |     915.6 ns |     909.2 ns |     923.6 ns |  1.00 |            Base |         - |          NA |
| Dequeue_And_Enqueue | Job-ZDEPQS | /runtime_csel/ |   10 |     910.6 ns |     4.65 ns |     4.35 ns |     910.7 ns |     902.2 ns |     917.9 ns |  0.99 |            Same |         - |          NA |
|                     |            |                |      |              |             |             |              |              |              |       |                 |           |             |
| Dequeue_And_Enqueue | Job-VTKOWH | /runtime_HEAD/ |  100 |  22,207.9 ns |    80.89 ns |    75.66 ns |  22,184.4 ns |  22,029.8 ns |  22,360.8 ns |  1.00 |            Base |         - |          NA |
| Dequeue_And_Enqueue | Job-ZDEPQS | /runtime_csel/ |  100 |  22,605.9 ns |    42.00 ns |    32.79 ns |  22,604.3 ns |  22,544.0 ns |  22,663.4 ns |  1.02 |            Same |         - |          NA |
|                     |            |                |      |              |             |             |              |              |              |       |                 |           |             |
| Dequeue_And_Enqueue | Job-VTKOWH | /runtime_HEAD/ | 1000 | 335,700.4 ns | 2,510.62 ns | 1,960.12 ns | 335,121.6 ns | 335,035.1 ns | 341,918.9 ns |  1.00 |            Base |       3 B |        1.00 |
| Dequeue_And_Enqueue | Job-ZDEPQS | /runtime_csel/ | 1000 | 372,872.5 ns | 2,539.69 ns | 1,982.83 ns | 372,303.9 ns | 372,067.3 ns | 379,155.5 ns |  1.11 |          Slower |       3 B |        1.00 |

|              Method |        Job |      Toolchain | Size |         Mean |       Error |    StdDev |       Median |          Min |          Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|-------------------- |----------- |--------------- |----- |-------------:|------------:|----------:|-------------:|-------------:|-------------:|------:|---------------- |----------:|------------:|
| Dequeue_And_Enqueue | Job-RZMVJD | /runtime_HEAD/ |   10 |     912.6 ns |     3.51 ns |   3.11 ns |     913.1 ns |     905.8 ns |     916.5 ns |  1.00 |            Base |         - |          NA |
| Dequeue_And_Enqueue | Job-SQCVDU | /runtime_csel/ |   10 |     907.7 ns |     2.35 ns |   2.20 ns |     907.6 ns |     904.1 ns |     911.6 ns |  0.99 |            Same |         - |          NA |
|                     |            |                |      |              |             |           |              |              |              |       |                 |           |             |
| Dequeue_And_Enqueue | Job-RZMVJD | /runtime_HEAD/ |  100 |  20,857.7 ns |    81.18 ns |  63.38 ns |  20,848.9 ns |  20,749.2 ns |  21,001.9 ns |  1.00 |            Base |         - |          NA |
| Dequeue_And_Enqueue | Job-SQCVDU | /runtime_csel/ |  100 |  21,198.8 ns |    27.42 ns |  21.41 ns |  21,201.8 ns |  21,152.0 ns |  21,232.4 ns |  1.02 |            Same |         - |          NA |
|                     |            |                |      |              |             |           |              |              |              |       |                 |           |             |
| Dequeue_And_Enqueue | Job-RZMVJD | /runtime_HEAD/ | 1000 | 335,674.1 ns |   660.01 ns | 551.14 ns | 335,548.3 ns | 335,317.2 ns | 337,477.0 ns |  1.00 |            Base |       3 B |        1.00 |
| Dequeue_And_Enqueue | Job-SQCVDU | /runtime_csel/ | 1000 | 335,257.5 ns | 1,107.66 ns | 864.79 ns | 334,981.0 ns | 334,812.3 ns | 337,982.3 ns |  1.00 |            Same |       3 B |        1.00 |

|              Method |        Job |      Toolchain | Size |         Mean |     Error |    StdDev |       Median |          Min |          Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|-------------------- |----------- |--------------- |----- |-------------:|----------:|----------:|-------------:|-------------:|-------------:|------:|---------------- |----------:|------------:|
| Dequeue_And_Enqueue | Job-OYUIRU | /runtime_HEAD/ |   10 |     921.1 ns |   2.32 ns |   2.05 ns |     921.4 ns |     916.7 ns |     924.2 ns |  1.00 |            Base |         - |          NA |
| Dequeue_And_Enqueue | Job-NWPZLS | /runtime_csel/ |   10 |     916.3 ns |   2.15 ns |   1.68 ns |     916.6 ns |     911.6 ns |     917.8 ns |  1.00 |            Same |         - |          NA |
|                     |            |                |      |              |           |           |              |              |              |       |                 |           |             |
| Dequeue_And_Enqueue | Job-OYUIRU | /runtime_HEAD/ |  100 |  22,568.8 ns |  70.73 ns |  55.22 ns |  22,579.7 ns |  22,462.9 ns |  22,628.4 ns |  1.00 |            Base |         - |          NA |
| Dequeue_And_Enqueue | Job-NWPZLS | /runtime_csel/ |  100 |  20,940.2 ns |  47.84 ns |  44.75 ns |  20,927.8 ns |  20,877.8 ns |  21,030.7 ns |  0.93 |          Faster |         - |          NA |
|                     |            |                |      |              |           |           |              |              |              |       |                 |           |             |
| Dequeue_And_Enqueue | Job-OYUIRU | /runtime_HEAD/ | 1000 | 337,295.6 ns | 204.63 ns | 159.76 ns | 337,336.8 ns | 336,850.6 ns | 337,471.5 ns |  1.00 |            Base |       3 B |        1.00 |
| Dequeue_And_Enqueue | Job-NWPZLS | /runtime_csel/ | 1000 | 337,909.2 ns | 132.52 ns | 110.66 ns | 337,888.3 ns | 337,696.1 ns | 338,103.8 ns |  1.00 |            Same |       3 B |        1.00 |

|              Method |        Job |      Toolchain | Size |         Mean |       Error |      StdDev |       Median |          Min |          Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|-------------------- |----------- |--------------- |----- |-------------:|------------:|------------:|-------------:|-------------:|-------------:|------:|---------------- |----------:|------------:|
| Dequeue_And_Enqueue | Job-TXJJRE | /runtime_HEAD/ |   10 |     903.6 ns |     5.24 ns |     4.38 ns |     903.9 ns |     896.1 ns |     910.6 ns |  1.00 |            Base |         - |          NA |
| Dequeue_And_Enqueue | Job-YZYMIP | /runtime_csel/ |   10 |     911.3 ns |     4.31 ns |     4.03 ns |     911.5 ns |     902.4 ns |     917.1 ns |  1.01 |            Same |         - |          NA |
|                     |            |                |      |              |             |             |              |              |              |       |                 |           |             |
| Dequeue_And_Enqueue | Job-TXJJRE | /runtime_HEAD/ |  100 |  22,832.6 ns |   131.61 ns |   109.90 ns |  22,821.6 ns |  22,667.4 ns |  22,990.8 ns |  1.00 |            Base |         - |          NA |
| Dequeue_And_Enqueue | Job-YZYMIP | /runtime_csel/ |  100 |  22,600.3 ns |    64.18 ns |    60.03 ns |  22,602.5 ns |  22,486.4 ns |  22,695.0 ns |  0.99 |            Same |         - |          NA |
|                     |            |                |      |              |             |             |              |              |              |       |                 |           |             |
| Dequeue_And_Enqueue | Job-TXJJRE | /runtime_HEAD/ | 1000 | 373,532.0 ns | 5,154.39 ns | 4,024.21 ns | 372,391.7 ns | 372,194.4 ns | 386,308.0 ns |  1.00 |            Base |       3 B |        1.00 |
| Dequeue_And_Enqueue | Job-YZYMIP | /runtime_csel/ | 1000 | 338,480.7 ns | 3,678.06 ns | 2,871.59 ns | 337,650.6 ns | 337,565.9 ns | 347,596.6 ns |  0.91 |          Faster |       3 B |        1.00 |

a74nh avatar Sep 06 '22 10:09 a74nh

New version allows for side effects in the JTRUE condition.

a74nh avatar Sep 07 '22 10:09 a74nh

These are the results of an spmidiff run on the latest:

[12:10:59] Found 547 files with textual diffs.
[12:10:59]
[12:10:59] Summary of Code Size diffs:
[12:10:59] (Lower is better)
[12:10:59]
[12:10:59] Total bytes of base: 138017124 (overridden on cmd)
[12:10:59] Total bytes of diff: 138016688 (overridden on cmd)
[12:10:59] Total bytes of delta: -436 (-0.00 % of base)
[12:10:59]     diff is an improvement.
[12:10:59]     relative diff is an improvement.
[12:10:59]
[12:10:59]
[12:10:59] Top file regressions (bytes):
[12:10:59]            4 : 163904.dasm (0.115% of base)
[12:10:59]
[12:10:59] Top file improvements (bytes):
[12:10:59]          -52 : 73782.dasm (-3.242% of base)
[12:10:59]          -20 : 217413.dasm (-4.348% of base)
[12:10:59]          -16 : 131576.dasm (-0.524% of base)
[12:10:59]          -16 : 18437.dasm (-8.889% of base)
[12:10:59]          -16 : 37694.dasm (-7.143% of base)
[12:10:59]          -16 : 17846.dasm (-8.333% of base)
[12:10:59]          -12 : 18481.dasm (-1.429% of base)
[12:10:59]          -12 : 47579.dasm (-0.419% of base)
[12:10:59]          -12 : 217390.dasm (-1.382% of base)
[12:10:59]           -8 : 18840.dasm (-2.703% of base)
[12:10:59]           -8 : 36599.dasm (-5.405% of base)
[12:10:59]           -8 : 58863.dasm (-0.357% of base)
[12:10:59]           -8 : 93319.dasm (-0.405% of base)
[12:10:59]           -8 : 191.dasm (-0.366% of base)
[12:10:59]           -8 : 23704.dasm (-2.857% of base)
[12:10:59]           -8 : 28672.dasm (-2.703% of base)
[12:10:59]           -8 : 19336.dasm (-2.703% of base)
[12:10:59]           -8 : 36467.dasm (-5.405% of base)
[12:10:59]           -8 : 17637.dasm (-2.703% of base)
[12:10:59]           -8 : 204836.dasm (-5.405% of base)
[12:10:59]
[12:10:59] 51 total files with Code Size differences (50 improved, 1 regressed), 572 unchanged.
[12:10:59]
[12:10:59] Top method regressions (bytes):
[12:10:59]            4 (0.115% of base) : 163904.dasm - Microsoft.CodeAnalysis.CSharp.MethodCompiler:CompileNamedType(Microsoft.CodeAnalysis.CSharp.Symbols.NamedTypeSymbol):this
[12:10:59]
[12:10:59] Top method improvements (bytes):
[12:10:59]          -52 (-3.242% of base) : 73782.dasm - Microsoft.Build.BackEnd.Logging.EventSourceSink:Consume(Microsoft.Build.Framework.BuildEventArgs):this
[12:10:59]          -20 (-4.348% of base) : 217413.dasm - DevDiv_278372:Test(int,int,int,int,int,int):int
[12:10:59]          -16 (-8.889% of base) : 18437.dasm - System.Collections.Generic.Stack`1[__Canon][System.__Canon]:PushWithResize(System.__Canon):this
[12:10:59]          -16 (-7.143% of base) : 37694.dasm - System.Collections.Generic.Stack`1[AndNode][System.Dynamic.BindingRestrictions+TestBuilder+AndNode]:PushWithResize(AndNode):this
[12:10:59]          -16 (-8.333% of base) : 17846.dasm - System.Collections.Generic.Stack`1[Int32][System.Int32]:PushWithResize(int):this
[12:10:59]          -16 (-0.524% of base) : 131576.dasm - VectorTest:Main():int
[12:10:59]          -12 (-1.429% of base) : 18481.dasm - Internal.IL.InstantiatedMethodIL:GetObject(int,int):System.Object:this
[12:10:59]          -12 (-1.382% of base) : 217390.dasm - System.Linq.Enumerable:MaxInteger(System.Collections.Generic.IEnumerable`1[Int32]):int
[12:10:59]          -12 (-0.419% of base) : 47579.dasm - System.Xml.Schema.ParticleContentValidator:BuildTransitionTable(System.Xml.Schema.BitSet,System.Xml.Schema.BitSet[],int):System.Int32[][]:this
[12:10:59]           -8 (-0.366% of base) : 191.dasm - <ReadBufferAsync>d__16:MoveNext():this
[12:10:59]           -8 (-1.869% of base) : 119952.dasm - GitHub_18497:Main():int
[12:10:59]           -8 (-0.187% of base) : 17523.dasm - Internal.TypeSystem.Ecma.EcmaModule:ResolveTypeReference(System.Reflection.Metadata.TypeReferenceHandle):System.Object:this
[12:10:59]           -8 (-2.857% of base) : 23704.dasm - System.Collections.Generic.List`1[__Canon][System.__Canon]:AddWithResize(System.__Canon):this
[12:10:59]           -8 (-2.469% of base) : 65713.dasm - System.Collections.Generic.List`1[__Canon][System.__Canon]:AddWithResize(System.__Canon):this
[12:10:59]           -8 (-5.405% of base) : 36599.dasm - System.Collections.Generic.List`1[__Canon][System.__Canon]:Grow(int):this
[12:10:59]           -8 (-5.405% of base) : 204836.dasm - System.Collections.Generic.List`1[__Canon][System.__Canon]:Grow(int):this
[12:10:59]           -8 (-2.703% of base) : 17332.dasm - System.Collections.Generic.List`1[Boolean][System.Boolean]:AddWithResize(bool):this
[12:10:59]           -8 (-2.381% of base) : 65658.dasm - System.Collections.Generic.List`1[Char][System.Char]:AddWithResize(ushort):this
[12:10:59]           -8 (-2.703% of base) : 23641.dasm - System.Collections.Generic.List`1[Char][System.Char]:AddWithResize(ushort):this
[12:10:59]           -8 (-5.405% of base) : 36467.dasm - System.Collections.Generic.List`1[Char][System.Char]:Grow(int):this
[12:10:59]
[12:10:59] Top method regressions (percentages):
[12:10:59]            4 (0.115% of base) : 163904.dasm - Microsoft.CodeAnalysis.CSharp.MethodCompiler:CompileNamedType(Microsoft.CodeAnalysis.CSharp.Symbols.NamedTypeSymbol):this
[12:10:59]
[12:10:59] Top method improvements (percentages):
[12:10:59]          -16 (-8.889% of base) : 18437.dasm - System.Collections.Generic.Stack`1[__Canon][System.__Canon]:PushWithResize(System.__Canon):this
[12:10:59]          -16 (-8.333% of base) : 17846.dasm - System.Collections.Generic.Stack`1[Int32][System.Int32]:PushWithResize(int):this
[12:10:59]          -16 (-7.143% of base) : 37694.dasm - System.Collections.Generic.Stack`1[AndNode][System.Dynamic.BindingRestrictions+TestBuilder+AndNode]:PushWithResize(AndNode):this
[12:10:59]           -8 (-5.405% of base) : 36599.dasm - System.Collections.Generic.List`1[__Canon][System.__Canon]:Grow(int):this
[12:10:59]           -8 (-5.405% of base) : 204836.dasm - System.Collections.Generic.List`1[__Canon][System.__Canon]:Grow(int):this
[12:10:59]           -8 (-5.405% of base) : 36467.dasm - System.Collections.Generic.List`1[Char][System.Char]:Grow(int):this
[12:10:59]           -8 (-5.405% of base) : 204535.dasm - System.Collections.Generic.List`1[Char][System.Char]:Grow(int):this
[12:10:59]           -8 (-4.762% of base) : 28673.dasm - System.Collections.Generic.List`1[Int32][System.Int32]:Grow(int):this
[12:10:59]          -20 (-4.348% of base) : 217413.dasm - DevDiv_278372:Test(int,int,int,int,int,int):int
[12:10:59]          -52 (-3.242% of base) : 73782.dasm - Microsoft.Build.BackEnd.Logging.EventSourceSink:Consume(Microsoft.Build.Framework.BuildEventArgs):this
[12:10:59]           -4 (-2.857% of base) : 18560.dasm - Internal.TypeSystem.MetadataTypeSystemContext:ComputeHasStaticConstructor(Internal.TypeSystem.TypeDesc):bool:this
[12:10:59]           -8 (-2.857% of base) : 23704.dasm - System.Collections.Generic.List`1[__Canon][System.__Canon]:AddWithResize(System.__Canon):this
[12:10:59]           -8 (-2.740% of base) : 154540.dasm - System.Collections.Generic.List`1[Double][System.Double]:AddWithResize(double):this
[12:10:59]           -8 (-2.703% of base) : 17332.dasm - System.Collections.Generic.List`1[Boolean][System.Boolean]:AddWithResize(bool):this
[12:10:59]           -8 (-2.703% of base) : 23641.dasm - System.Collections.Generic.List`1[Char][System.Char]:AddWithResize(ushort):this
[12:10:59]           -8 (-2.703% of base) : 17637.dasm - System.Collections.Generic.List`1[CorJitFlag][Internal.JitInterface.CorJitFlag]:AddWithResize(int):this
[12:10:59]           -8 (-2.703% of base) : 28672.dasm - System.Collections.Generic.List`1[Int32][System.Int32]:AddWithResize(int):this
[12:10:59]           -8 (-2.703% of base) : 18840.dasm - System.Collections.Generic.List`1[IntPtr][System.IntPtr]:AddWithResize(long):this
[12:10:59]           -8 (-2.703% of base) : 19336.dasm - System.Collections.Generic.List`1[UInt16][System.UInt16]:AddWithResize(ushort):this
[12:10:59]           -8 (-2.703% of base) : 8561.dasm - System.Collections.Generic.List`1[Vector`1][System.Numerics.Vector`1[System.Int32]]:AddWithResize(System.Numerics.Vector`1[Int32]):this
[12:10:59]
[12:10:59] 51 total methods with Code Size differences (50 improved, 1 regressed), 572 unchanged.
[12:10:59]

a74nh avatar Sep 20 '22 09:09 a74nh

Looking through the remaining outstanding comments:

I fixed up the valuenum.cpp comment.

Note we may see PHIs in the conditional block (even if there's no join) as leftovers from a prior flow optimization.

As the code stands, it would reject any block with PHIs in them. I think that's the behaviour we want.

a74nh avatar Sep 21 '22 12:09 a74nh

Does this PR handle int x = cond ? 1 : 0; ?

EgorBo avatar Oct 05 '22 01:10 EgorBo

Does this PR handle int x = cond ? 1 : 0; ?

No, because that has got the else condition of 0. Plan was to add that in a later patch.

Thinking about it, we'd want to have a specific check so that the true path and false path can both get caught by a single SELECT node.

Which is different entirely to an independent else case: if cond {x=1;} else {y=1;}

a74nh avatar Oct 05 '22 14:10 a74nh

Added two Stress options: STRESS_IF_CONVERSION_INNER_LOOPS to allow inner loops and STRESS_IF_CONVERSION_COST to allow all costs.

a74nh avatar Oct 07 '22 10:10 a74nh

@a74nh I see some diffs that look like the following:

-G_M30669_IG05:        ; gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref, isz
+G_M30669_IG05:        ; gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref
             mov     w0, wzr
             tst     w1, dotnet/perf-autofiling-issues#12
-            beq     G_M30669_IG06
-            mov     w0, dotnet/perf-autofiling-issues#12
-                                               ;; size=16 bbWeight=0.50 PerfScore 1.25
-G_M30669_IG06:        ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
+            cset    x2, ne
+            mov     w3, dotnet/perf-autofiling-issues#12
+            cmp     w2, #0
+            csel    w0, w3, w0, ne

Seems like we are missing some opportunity here to fold a relop. At the very least it could be:

             mov     w0, wzr
             tst     w1, dotnet/perf-autofiling-issues#12
-            beq     G_M30669_IG06
-            mov     w0, dotnet/perf-autofiling-issues#12
+            mov     w2, dotnet/perf-autofiling-issues#12
+            csel    w0, w0, w2, eq

right?

This example is from System.Net.Security.SslStream:GetSslProtocolInternal():int:this in the benchmarks collection. Of course there is some opportunity to fold the wzr too, but I guess you are leaving that up to a future PR.

I know I said I wasn't that worried about the size wise regressions, but the current diffs do look somewhat scary, so if there are some easy cases it would be good to resolve them.

Another case from IfStatements.IfStatements:SingleInner(int) (I guess this one was added by you):

-G_M9208_IG02:        ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
-            tbnz    w0, #0, G_M9208_IG04
-                                               ;; size=4 bbWeight=1    PerfScore 1.00
-G_M9208_IG03:        ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
-            mov     w0, dotnet/perf-autofiling-issues#5
-                                               ;; size=4 bbWeight=0.50 PerfScore 0.25
-G_M9208_IG04:        ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
+G_M9208_IG02:        ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
+            tst     w0, #1
+            cset    x1, eq
+            mov     w2, dotnet/perf-autofiling-issues#5
+            cmp     w1, #0
+            csel    w0, w2, w0, ne

This looks odd too. We should maybe skip the transformation for the cbz/cbnz/tbz/tbnz cases for now.

It would be good to run some more of these micro benchmarks with your change and check how they are affected.

jakobbotsch avatar Oct 10 '22 11:10 jakobbotsch

Another case from IfStatements.IfStatements:SingleInner(int) (I guess this one was added by you):

Ok, that's due to the JTRUE having an AND statement. Without my patch the AND gets folded away.

N006 (  7,  9) [000005] -----------                         *  JTRUE     void   $VN.Void
N005 (  5,  7) [000004] J------N---                         \--*  NE        int    $101
N003 (  3,  4) [000014] -----------                            +--*  AND       int    $100
N001 (  1,  1) [000000] -----------                            |  +--*  LCL_VAR   int    V00 arg0         u:1 $80
N002 (  1,  2) [000013] -----------                            |  \--*  CNS_INT   int    1 $41
N004 (  1,  2) [000003] -----------                            \--*  CNS_INT   int    0 $40

I've added a small change to avoid any compares against 0. That should get rid of most the regressions. However, it's not going to catch the cases where the comparison is later optimized into a compare against zero

( On a side note, I've been having a quick look at adding else cases. z = (a==0) ? c : d should have no regressions. )

Seems like we are missing some opportunity here to fold a relop

Still need to check this one.

a74nh avatar Oct 10 '22 13:10 a74nh

Seems like we are missing some opportunity here to fold a relop

Still need to check this one.

Ahh, this is essentially the same case as above

a74nh avatar Oct 10 '22 13:10 a74nh

Ahh, this is essentially the same case as above

Just to confirm, but we should be able to handle that with csel, right? I assume we are missing some opportunity maybe downstream or in if-conversion itself.

The regressions still look to be some of the 'bad' cases related to bit tests and unnecessary additional relops in the converted variant:

-            tbz     w1, #1, G_M20791_IG06
-                                               ;; size=8 bbWeight=1    PerfScore 1.50
-G_M20791_IG05:        ; gcrefRegs=3E80000 {x19 x21 x22 x23 x24 x25}, byrefRegs=0000 {}, byref
-            mov     w26, dotnet/perf-autofiling-issues#5
-                                               ;; size=4 bbWeight=0.50 PerfScore 0.25
-G_M20791_IG06:        ; gcrefRegs=3E80000 {x19 x21 x22 x23 x24 x25}, byrefRegs=0000 {}, byref, isz
+            tst     w1, dotnet/runtime#56343
+            cset    x0, ne
+            mov     w2, dotnet/perf-autofiling-issues#5
+            cmp     w0, #0
+            csel    w26, w2, w26, ne

(System.Reflection.Emit.SignatureHelper:GetMethodSigHelper in benchmarks collection)

jakobbotsch avatar Oct 11 '22 11:10 jakobbotsch

This diff in LUDecomp:lubksb(double[][],int,int[],double[]) from benchmarks looks odd as well:

 G_M64421_IG16:        ; gcrefRegs=000D {x0 x2 x3}, byrefRegs=0200 {x9}, byref, isz
             fcmp    d16, #0.0
-            beq     G_M64421_IG17
-            mov     w4, w5
-                                               ;; size=12 bbWeight=0.02 PerfScore 0.07
-G_M64421_IG17:        ; gcrefRegs=000D {x0 x2 x3}, byrefRegs=0200 {x9}, byref, isz
+            cset    x6, gt
+            ble     G_M64421_IG17
+            cset    x6, lo
+                                               ;; size=16 bbWeight=0.02 PerfScore 0.08
+G_M64421_IG17:        ; gcrefRegs=000D {x0 x2 x3}, byrefRegs=0200 {x9}, byref
+            cmp     w6, #0
+            csel    w4, w5, w4, ne
+                                               ;; size=8 bbWeight=0.02 PerfScore 0.02
+G_M64421_IG18:        ; gcrefRegs=000D {x0 x2 x3}, byrefRegs=0200 {x9}, byref, isz

May be good to extract some (float) relops into some tests to get some specific coverage of the if conversion pass.

jakobbotsch avatar Oct 11 '22 12:10 jakobbotsch