Arm64: Add If conversion pass
This patch builds upon the work in https://github.com/dotnet/runtime/pull/71705
A single condition can be optimised by replacing a JTRUE with a SELECT and removing the control flow.
For example:
if (op1 > 0) { op1 = 5; }
IN0001: 000008 mov w2, #5
IN0002: 00000C cmp w0, #0
IN0003: 000010 csel w0, w0, w2, eq
Additional JTRUE statements can be replaced by placing AND chains onto the conditional part of a SELECT.
For example:
if (op1 > 3 && op2 != 10 && op3 < 7) { op1 = 5; }
IN0001: 000008 cmp w2, #7
IN0002: 00000C ccmp w1, #10, z, lt
IN0003: 000010 ccmp w0, #3, nzc, ne
IN0004: 000014 mov w2, #5
IN0005: 000018 csel w0, w2, w0, gt
Performance tests are here: https://github.com/dotnet/performance/pull/2517
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Issue Details
This patch builds upon the work in https://github.com/dotnet/runtime/pull/71705
A single condition can be optimised by replacing a JTRUE with a SELECT and removing the control flow.
For example:
if (op1 > 0) { op1 = 5; }
IN0001: 000008 mov w2, dotnet/perf-autofiling-issues#5
IN0002: 00000C cmp w0, #0
IN0003: 000010 csel w0, w0, w2, eq
Additional JTRUE statements can be replaced by placing AND chains onto the conditional part of a SELECT.
For example:
if (op1 > 3 && op2 != 10 && op3 < 7) { op1 = 5; }
IN0001: 000008 cmp w2, dotnet/runtime#56402
IN0002: 00000C ccmp w1, dotnet/perf-autofiling-issues#10, z, lt
IN0003: 000010 ccmp w0, dotnet/perf-autofiling-issues#3, nzc, ne
IN0004: 000014 mov w2, dotnet/perf-autofiling-issues#5
IN0005: 000018 csel w0, w2, w0, gt
Performance tests are here: https://github.com/dotnet/performance/pull/2517
| Author: | a74nh |
|---|---|
| Assignees: | - |
| Labels: |
|
| Milestone: | - |
The first commit in this PR is a copy of https://github.com/dotnet/runtime/pull/71705. Once that PR is merged I will remove that commit from this patch. This was just so I could get this out a little earlier.
@kunalspathak @jakobbotsch
Also, this probably needs an additional check in the optimizer pass to limit the max size of the chains.
PR is now rebased on top of HEAD, and is good for a review. Taking this out of draft.
I do have the one failure being hit a few times in the testsuite - debugging this now. However, I'm not expecting that to make big changes to the patch.
I'm away for two weeks from Monday. Getting this in before then feels unlikely. However, it'd be good to know how close it is.
@kunalspathak @jakobbotsch
The diffs look like size wise regressions, why is that? I would expect this pattern should give smaller code in most cases.
For all the cases where the code is longer that will be due to no longer using cbz/cbnz instructions:
cbnz x0, LABEL1
mov x1, 5
LABEL1:
now becomes:
cmp x0, 0
mov x2, 5
csel x1, x2, x1, eq
Which is one instruction longer. Assuming the branch taken/not taken is fairly random, then the new code will be faster. I should probably extend https://github.com/dotnet/performance/pull/2517 to make sure those cases are being tested.
Assuming the branch taken/not taken is fairly random, then the new code will be faster.
I'm not sure this is a fair assumption in practice. Can you post some micro benchmarks for all cases (both randomly taken branches and biased branches)?
/azp run Fuzzlyn
Azure Pipelines successfully started running 1 pipeline(s).
As a side point - when rebasing is it best to pull/rebase/forcepush or is a merge commit preferred?
As a side point - when rebasing is it best to pull/rebase/forcepush or is a merge commit preferred?
Merge commits are preferred (essentially, it is easiest for reviewers if you never force push a non-draft PR).
Assuming the branch taken/not taken is fairly random, then the new code will be faster.
I'm not sure this is a fair assumption in practice. Can you post some micro benchmarks for all cases (both randomly taken branches and biased branches)?
I expanded the test Single() in the If Statement benchmarks that I previously added to the performance repo.
https://github.com/dotnet/performance/blob/5f63e8dcf81a0f9ddf5a80bd791d1fbaf0babd61/src/benchmarks/micro/runtime/Statements/If.cs#L46
[Benchmark]
public void Single() {
for (int i = 0; i < Iterations; i++) {
SingleInner(inputs[i]);
}
}
[MethodImpl(MethodImplOptions.NoInlining)]
public static void SingleInner(int op1) {
if (op1 % 2 == 0) {
op1 = 5;
}
Consume(op1, 0, 0, 0);
}
(I'll post a link to the patch later)
For this first set: Input is random data. Mod 2 in the first test gives roughly 50/50 split. Each following test increases the mod, giving more weight to jumping over the condition. Eg: Single(): if (op1 % 2 == 0) { op1 = 5; } Single3(): if (op1 % 3 == 0) { op1 = 5; } Single4(): if (op1 % 4 == 0) { op1 = 5; } etc
Performance for the ~50/50 is 46% faster, and then drops off as the taken/not taken moves away from 50/50. But even with mod6 it's still 24% faster.
| Method | Job | Toolchain | Mean | Error | StdDev | Median | Min | Max | Ratio | MannWhitney(2%) |
|--------------- |----------- |----------------------------------------------------------------------------------------------------------- |
| Single | Job-BEWOVY | /runtime_HEAD/ | 62.27 us | 0.034 us | 0.028 us | 62.28 us | 62.22 us | 62.31 us | 1.00 | Base |
| Single | Job-EDSOJO | /runtime_csel/ | 33.65 us | 0.110 us | 0.097 us | 33.65 us | 33.48 us | 33.82 us | 0.54 | Faster |
| | | |
| Single3 | Job-BEWOVY | /runtime_HEAD/ | 65.84 us | 0.069 us | 0.061 us | 65.82 us | 65.77 us | 65.96 us | 1.00 | Base |
| Single3 | Job-EDSOJO | /runtime_csel/ | 39.42 us | 0.032 us | 0.026 us | 39.41 us | 39.39 us | 39.48 us | 0.60 | Faster |
| | | |
| Single4 | Job-BEWOVY | /runtime_HEAD/ | 52.70 us | 0.011 us | 0.009 us | 52.70 us | 52.69 us | 52.72 us | 1.00 | Base |
| Single4 | Job-EDSOJO | /runtime_csel/ | 33.69 us | 0.168 us | 0.140 us | 33.75 us | 33.40 us | 33.84 us | 0.64 | Faster |
| | | |
| Single5 | Job-BEWOVY | /runtime_HEAD/ | 56.68 us | 0.057 us | 0.050 us | 56.70 us | 56.58 us | 56.72 us | 1.00 | Base |
| Single5 | Job-EDSOJO | /runtime_csel/ | 39.75 us | 0.014 us | 0.013 us | 39.75 us | 39.73 us | 39.77 us | 0.70 | Faster |
| | | |
| Single6 | Job-BEWOVY | /runtime_HEAD/ | 51.56 us | 0.025 us | 0.022 us | 51.57 us | 51.50 us | 51.58 us | 1.00 | Base |
| Single6 | Job-EDSOJO | /runtime_csel/ | 39.42 us | 0.027 us | 0.023 us | 39.41 us | 39.37 us | 39.45 us | 0.76 | Faster |
For this set, I reverse the functions, so that the jump is increasingly likely to happen. Eg: Single(): if (op1 % 2 != 0) { op1 = 5; } SingleRev3(): if (op1 % 3 != 0) { op1 = 5; } etc As expected, results are the same as above.
| | | |
| SingleRev | Job-BEWOVY | /runtime_HEAD/ | 62.70 us | 0.048 us | 0.040 us | 62.69 us | 62.65 us | 62.77 us | 1.00 | Base |
| SingleRev | Job-EDSOJO | /runtime_csel/ | 33.62 us | 0.039 us | 0.032 us | 33.64 us | 33.58 us | 33.67 us | 0.54 | Faster |
| | | |
| SingleRev3 | Job-BEWOVY | /runtime_HEAD/ | 64.90 us | 0.023 us | 0.019 us | 64.90 us | 64.86 us | 64.93 us | 1.00 | Base |
| SingleRev3 | Job-EDSOJO | /runtime_csel/ | 39.38 us | 0.028 us | 0.023 us | 39.38 us | 39.34 us | 39.42 us | 0.61 | Faster |
| | | |
| SingleRev4 | Job-BEWOVY | /runtime_HEAD/ | 51.89 us | 0.192 us | 0.171 us | 51.80 us | 51.75 us | 52.18 us | 1.00 | Base |
| SingleRev4 | Job-EDSOJO | /runtime_csel/ | 33.64 us | 0.195 us | 0.173 us | 33.74 us | 33.40 us | 33.82 us | 0.65 | Faster |
| | | |
| SingleRev5 | Job-BEWOVY | /runtime_HEAD/ | 57.31 us | 0.020 us | 0.018 us | 57.31 us | 57.28 us | 57.34 us | 1.00 | Base |
| SingleRev5 | Job-EDSOJO | /runtime_csel/ | 40.00 us | 0.530 us | 0.442 us | 39.75 us | 39.68 us | 40.65 us | 0.70 | Faster |
| | | |
| SingleRev6 | Job-BEWOVY | /runtime_HEAD/ | 51.72 us | 0.027 us | 0.024 us | 51.71 us | 51.69 us | 51.76 us | 1.00 | Base |
| SingleRev6 | Job-EDSOJO | /runtime_csel/ | 39.42 us | 0.025 us | 0.021 us | 39.42 us | 39.39 us | 39.46 us | 0.76 | Faster |
For the last two sets, I use exactly the same functions as above, but the input data is sequential instead of random. SingleSeq will alternate exactly 50/50: pass, fail, pass, fail... SingleSeq3 will pass, fail, fail, pass, fail, fail... SingleSeq4 will pass, fail, fail, fail, pass, fail, fail fail... etc
Here we see performance gets worse, but only slightly - worse case is 6% slower. Oddly, one of the cases is faster.
| SingleSeq | Job-BEWOVY | /runtime_HEAD/ | 33.42 us | 0.066 us | 0.055 us | 33.41 us | 33.37 us | 33.56 us | 1.00 | Base |
| SingleSeq | Job-EDSOJO | /runtime_csel/ | 33.62 us | 0.116 us | 0.103 us | 33.67 us | 33.39 us | 33.71 us | 1.01 | Same |
| | | |
| SingleSeq3 | Job-BEWOVY | /runtime_HEAD/ | 37.51 us | 0.020 us | 0.018 us | 37.51 us | 37.48 us | 37.54 us | 1.00 | Base |
| SingleSeq3 | Job-EDSOJO | /runtime_csel/ | 39.42 us | 0.015 us | 0.014 us | 39.42 us | 39.40 us | 39.45 us | 1.05 | Slower |
| | | |
| SingleSeq4 | Job-BEWOVY | /runtime_HEAD/ | 35.04 us | 0.004 us | 0.003 us | 35.04 us | 35.03 us | 35.04 us | 1.00 | Base |
| SingleSeq4 | Job-EDSOJO | /runtime_csel/ | 33.65 us | 0.124 us | 0.110 us | 33.69 us | 33.38 us | 33.73 us | 0.96 | Faster |
| | | |
| SingleSeq5 | Job-BEWOVY | /runtime_HEAD/ | 38.71 us | 0.258 us | 0.229 us | 38.88 us | 38.41 us | 38.93 us | 1.00 | Base |
| SingleSeq5 | Job-EDSOJO | /runtime_csel/ | 39.68 us | 0.021 us | 0.018 us | 39.69 us | 39.65 us | 39.71 us | 1.03 | Same |
| | | |
| SingleSeq6 | Job-BEWOVY | /runtime_HEAD/ | 37.26 us | 0.017 us | 0.013 us | 37.25 us | 37.24 us | 37.29 us | 1.00 | Base |
| SingleSeq6 | Job-EDSOJO | /runtime_csel/ | 39.43 us | 0.020 us | 0.017 us | 39.43 us | 39.40 us | 39.47 us | 1.06 | Slower |
| SingleSeqRev | Job-BEWOVY | /runtime_HEAD/ | 33.51 us | 0.204 us | 0.170 us | 33.42 us | 33.37 us | 33.82 us | 1.00 | Base |
| SingleSeqRev | Job-EDSOJO | /runtime_csel/ | 33.58 us | 0.124 us | 0.110 us | 33.63 us | 33.38 us | 33.67 us | 1.00 | Same |
| | | |
| SingleSeqRev3 | Job-BEWOVY | /runtime_HEAD/ | 38.10 us | 0.031 us | 0.026 us | 38.10 us | 38.04 us | 38.14 us | 1.00 | Base |
| SingleSeqRev3 | Job-EDSOJO | /runtime_csel/ | 39.34 us | 0.024 us | 0.021 us | 39.34 us | 39.31 us | 39.38 us | 1.03 | Slower |
| | | |
| SingleSeqRev4 | Job-BEWOVY | /runtime_HEAD/ | 33.36 us | 0.003 us | 0.002 us | 33.36 us | 33.36 us | 33.37 us | 1.00 | Base |
| SingleSeqRev4 | Job-EDSOJO | /runtime_csel/ | 33.60 us | 0.170 us | 0.151 us | 33.68 us | 33.37 us | 33.73 us | 1.01 | Same |
| | | |
| SingleSeqRev5 | Job-BEWOVY | /runtime_HEAD/ | 39.04 us | 0.053 us | 0.050 us | 39.06 us | 38.93 us | 39.09 us | 1.00 | Base |
| SingleSeqRev5 | Job-EDSOJO | /runtime_csel/ | 39.68 us | 0.017 us | 0.015 us | 39.68 us | 39.65 us | 39.71 us | 1.02 | Same |
| | | |
| SingleSeqRev6 | Job-BEWOVY | /runtime_HEAD/ | 37.91 us | 0.030 us | 0.026 us | 37.91 us | 37.87 us | 37.96 us | 1.00 | Base |
| SingleSeqRev6 | Job-EDSOJO | /runtime_csel/ | 39.37 us | 0.029 us | 0.026 us | 39.37 us | 39.33 us | 39.43 us | 1.04 | Slower |
These are all on an Altra. With older Arm hardware the performance gains get smaller as the hardware gets older.
What does it look like for always taken/always not taken branches?
How does it affect Queue<T>.Enqueue/Queue<T>.Dequeue in a loop? It uses this function:
https://github.com/dotnet/runtime/blob/8fe12bc5a66a5ef42200a3d88d68b5991eb158e6/src/libraries/System.Private.CoreLib/src/System/Collections/Generic/Queue.cs#L344-L356
Does the change get this case?
And thanks for posting the benchmarks 🙂
What does it look like for always taken/always not taken branches?
Not so great.... For both of these it's mod2 with inputs of all zeros. SingleSeqAlways - Always passes the condition. 7% slower. SingleSeqAlwaysNever - Never passes the condition, 12% slower
| Method | Job | Toolchain | Mean | Error | StdDev | Median | Min | Max | Ratio | MannWhitney(2%) |
|--------------------- |----------- |----------------------------------------------------------------------------------------------------------- |
| SingleSeqAlways | Job-QIHMDO | /runtime_HEAD/ | 31.38 us | 0.063 us | 0.056 us | 31.38 us | 31.30 us | 31.48 us | 1.00 | Base |
| SingleSeqAlways | Job-SMJRRR | /runtime_csel/ | 33.60 us | 0.037 us | 0.031 us | 33.61 us | 33.55 us | 33.66 us | 1.07 | Slower |
| | | |
| SingleSeqAlwaysNever | Job-QIHMDO | /runtime_HEAD/ | 30.10 us | 0.060 us | 0.047 us | 30.11 us | 30.05 us | 30.18 us | 1.00 | Base |
| SingleSeqAlwaysNever | Job-SMJRRR | /runtime_csel/ | 33.58 us | 0.127 us | 0.106 us | 33.61 us | 33.40 us | 33.71 us | 1.12 | Slower |
How does it affect
Queue<T>.Enqueue/Queue<T>.Dequeuein a loop? It uses this function:https://github.com/dotnet/runtime/blob/8fe12bc5a66a5ef42200a3d88d68b5991eb158e6/src/libraries/System.Private.CoreLib/src/System/Collections/Generic/Queue.cs#L344-L356
Does the change get this case? And thanks for posting the benchmarks 🙂
I extracted this into a test case and it fails to If Convert:
***** BB01
STMT00001 ( 0x005[E-] ... 0x00E )
N008 ( 10, 9) [000011] ---XG------ * JTRUE void <l:$1ca, c:$1cb>
N007 ( 8, 7) [000010] N--XG--N-U- \--* NE int <l:$20b, c:$20c>
N005 ( 6, 5) [000009] ---XG------ +--* ARR_LENGTH int <l:$207, c:$208>
N004 ( 4, 3) [000008] ---XG------ | \--* IND ref <l:$1c4, c:$1c5>
N003 ( 3, 4) [000021] -------N--- | \--* ADD byref $280
N001 ( 1, 1) [000007] ----------- | +--* LCL_VAR ref V00 this u:1 (last use) $80
N002 ( 1, 2) [000020] ----------- | \--* CNS_INT long 8 Fseq[_array] $240
N006 ( 1, 1) [000006] ----------- \--* LCL_VAR int V02 loc0 u:2 <l:$202, c:$203>
The NE fails on a flags GTF_SIDE_EFFECT not set check - as both GTF_EXCEPT and GTF_GLOB_REF are set. As I understood it, it wouldn't be safe to move that? It's a shame if it can't as that will stop all member data being used.
As I understood it, it wouldn't be safe to move that?
It would be fine to do the transformation regardless of side effects on the condition. It's what I pointed out in https://github.com/dotnet/runtime/pull/73472#discussion_r942474901. (Edit: I suppose this is true only after the and-chains have been removed.)
Not so great.... For both of these it's mod2 with inputs of all zeros. SingleSeqAlways - Always passes the condition. 7% slower. SingleSeqAlwaysNever - Never passes the condition, 12% slower
It's a bit worrying indeed. We probably need some heuristics and perhaps see if we can use PGO data, though at this late stage in the JIT the PGO data might not be all that reliable.
As I understood it, it wouldn't be safe to move that?
It would be fine to do the transformation regardless of side effects on the condition. It's what I pointed out in #73472 (comment). (Edit: I suppose this is true only after the and-chains have been removed.)
Ahh, I missed that part.
I suppose this is true only after the and-chains have been removed
Are you suggesting that this check can't be removed until the CCMP patch has also been done?
Regardless, I tried this out....
Creating a quick benchmark:
[Benchmark]
public void MoveNext() {
int index = 0;
for (int i = 0; i < Iterations; i++) {
MoveNextInner(ref index);
}
}
[MethodImpl(MethodImplOptions.NoInlining)]
private void MoveNextInner(ref int index)
{
// It is tempting to use the remainder operator here but it is actually much slower
// than a simple comparison and a rarely taken branch.
// JIT produces better code than with ternary operator ?:
int tmp = index + 1;
if (tmp == _array.Length)
{
tmp = 0;
}
index = tmp;
}
With 10,000 itterations and array size of 5, gives us 12% speedup:
| Method | Job | Toolchain | Mean | Error | StdDev | Median | Min | Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|--------- |----------- |----------------|---------:|---------:|---------:|---------:|---------:|---------:|------:|---------------- |----------:|------------:|
| MoveNext | Job-OVNAZN | /runtime_HEAD/ | 26.29 us | 0.001 us | 0.001 us | 26.29 us | 26.28 us | 26.29 us | 1.00 | Base | - | NA |
| MoveNext | Job-STXUQS | /runtime_csel/ | 23.03 us | 0.001 us | 0.001 us | 23.03 us | 23.03 us | 23.03 us | 0.88 | Faster | - | NA |
I increased array size to 300, and it still does better than I expected, at 9% better:
| Method | Job | Toolchain | Mean | Error | StdDev | Median | Min | Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|--------- |----------- |--------------- |---------:|---------:|---------:|---------:|---------:|---------:|------:|---------------- |----------:|------------:|
| MoveNext | Job-LENSWQ | /runtime_HEAD/ | 26.29 us | 0.002 us | 0.002 us | 26.29 us | 26.28 us | 26.29 us | 1.00 | Base | - | NA |
| MoveNext | Job-CZJSLJ | /runtime_csel/ | 23.96 us | 0.006 us | 0.005 us | 23.96 us | 23.95 us | 23.97 us | 0.91 | Faster | - | NA |
As I understood it, it wouldn't be safe to move that?
It would be fine to do the transformation regardless of side effects on the condition. It's what I pointed out in #73472 (comment). (Edit: I suppose this is true only after the and-chains have been removed.)
Ahh, I missed that part.
I suppose this is true only after the and-chains have been removed
Are you suggesting that this check can't be removed until the CCMP patch has also been done?
Actually, I think it's safe for this PR. The transformation is only ever squashing together a JTRUE and an assignment directly connected to it. It doesn't matter if doing the check causes a side effect, because the condition always had to be checked.
Once we get to AND chains then the AND chain can't have side conditions inside the chain (due to always having to evaluate everything in the chain).
Are you suggesting that this check can't be removed until the CCMP patch has also been done?
No, it should be fine now. I was not sure if this check used to be guarding just the if-conversion or also the conversion of short-circuiting AND to bitwise AND.
The new benchmarks look good. I am also a little surprised that this is beneficial even at 300 items. What is the diff of the code?
(Edit: I see now this is a micro benchmark of just this code. Can you show the diff for Queue<int>.Enqueue/Dequeue where this function is going to be inlined into?)
What is the diff of the code?
This is my full set of benchmark changes: https://github.com/a74nh/performance/commit/b1c00bafdcb19edcc12640b8e3e2f16752277e80 (Delay sharing was just internal process on my end that we need to go through before sharing)
Queue
.Enqueue/Dequeue
Perf_PriorityQueue already has some tests for this. However my results aren't stable - I'm getting some large errors in the mix. Here's a few different runs:
| Method | Job | Toolchain | Size | Mean | Error | StdDev | Median | Min | Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|-------------------- |----------- |--------------- |----- |-------------:|------------:|------------:|-------------:|-------------:|-------------:|------:|---------------- |----------:|------------:|
| Dequeue_And_Enqueue | Job-VTKOWH | /runtime_HEAD/ | 10 | 916.2 ns | 4.66 ns | 4.35 ns | 915.6 ns | 909.2 ns | 923.6 ns | 1.00 | Base | - | NA |
| Dequeue_And_Enqueue | Job-ZDEPQS | /runtime_csel/ | 10 | 910.6 ns | 4.65 ns | 4.35 ns | 910.7 ns | 902.2 ns | 917.9 ns | 0.99 | Same | - | NA |
| | | | | | | | | | | | | | |
| Dequeue_And_Enqueue | Job-VTKOWH | /runtime_HEAD/ | 100 | 22,207.9 ns | 80.89 ns | 75.66 ns | 22,184.4 ns | 22,029.8 ns | 22,360.8 ns | 1.00 | Base | - | NA |
| Dequeue_And_Enqueue | Job-ZDEPQS | /runtime_csel/ | 100 | 22,605.9 ns | 42.00 ns | 32.79 ns | 22,604.3 ns | 22,544.0 ns | 22,663.4 ns | 1.02 | Same | - | NA |
| | | | | | | | | | | | | | |
| Dequeue_And_Enqueue | Job-VTKOWH | /runtime_HEAD/ | 1000 | 335,700.4 ns | 2,510.62 ns | 1,960.12 ns | 335,121.6 ns | 335,035.1 ns | 341,918.9 ns | 1.00 | Base | 3 B | 1.00 |
| Dequeue_And_Enqueue | Job-ZDEPQS | /runtime_csel/ | 1000 | 372,872.5 ns | 2,539.69 ns | 1,982.83 ns | 372,303.9 ns | 372,067.3 ns | 379,155.5 ns | 1.11 | Slower | 3 B | 1.00 |
| Method | Job | Toolchain | Size | Mean | Error | StdDev | Median | Min | Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|-------------------- |----------- |--------------- |----- |-------------:|------------:|----------:|-------------:|-------------:|-------------:|------:|---------------- |----------:|------------:|
| Dequeue_And_Enqueue | Job-RZMVJD | /runtime_HEAD/ | 10 | 912.6 ns | 3.51 ns | 3.11 ns | 913.1 ns | 905.8 ns | 916.5 ns | 1.00 | Base | - | NA |
| Dequeue_And_Enqueue | Job-SQCVDU | /runtime_csel/ | 10 | 907.7 ns | 2.35 ns | 2.20 ns | 907.6 ns | 904.1 ns | 911.6 ns | 0.99 | Same | - | NA |
| | | | | | | | | | | | | | |
| Dequeue_And_Enqueue | Job-RZMVJD | /runtime_HEAD/ | 100 | 20,857.7 ns | 81.18 ns | 63.38 ns | 20,848.9 ns | 20,749.2 ns | 21,001.9 ns | 1.00 | Base | - | NA |
| Dequeue_And_Enqueue | Job-SQCVDU | /runtime_csel/ | 100 | 21,198.8 ns | 27.42 ns | 21.41 ns | 21,201.8 ns | 21,152.0 ns | 21,232.4 ns | 1.02 | Same | - | NA |
| | | | | | | | | | | | | | |
| Dequeue_And_Enqueue | Job-RZMVJD | /runtime_HEAD/ | 1000 | 335,674.1 ns | 660.01 ns | 551.14 ns | 335,548.3 ns | 335,317.2 ns | 337,477.0 ns | 1.00 | Base | 3 B | 1.00 |
| Dequeue_And_Enqueue | Job-SQCVDU | /runtime_csel/ | 1000 | 335,257.5 ns | 1,107.66 ns | 864.79 ns | 334,981.0 ns | 334,812.3 ns | 337,982.3 ns | 1.00 | Same | 3 B | 1.00 |
| Method | Job | Toolchain | Size | Mean | Error | StdDev | Median | Min | Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|-------------------- |----------- |--------------- |----- |-------------:|----------:|----------:|-------------:|-------------:|-------------:|------:|---------------- |----------:|------------:|
| Dequeue_And_Enqueue | Job-OYUIRU | /runtime_HEAD/ | 10 | 921.1 ns | 2.32 ns | 2.05 ns | 921.4 ns | 916.7 ns | 924.2 ns | 1.00 | Base | - | NA |
| Dequeue_And_Enqueue | Job-NWPZLS | /runtime_csel/ | 10 | 916.3 ns | 2.15 ns | 1.68 ns | 916.6 ns | 911.6 ns | 917.8 ns | 1.00 | Same | - | NA |
| | | | | | | | | | | | | | |
| Dequeue_And_Enqueue | Job-OYUIRU | /runtime_HEAD/ | 100 | 22,568.8 ns | 70.73 ns | 55.22 ns | 22,579.7 ns | 22,462.9 ns | 22,628.4 ns | 1.00 | Base | - | NA |
| Dequeue_And_Enqueue | Job-NWPZLS | /runtime_csel/ | 100 | 20,940.2 ns | 47.84 ns | 44.75 ns | 20,927.8 ns | 20,877.8 ns | 21,030.7 ns | 0.93 | Faster | - | NA |
| | | | | | | | | | | | | | |
| Dequeue_And_Enqueue | Job-OYUIRU | /runtime_HEAD/ | 1000 | 337,295.6 ns | 204.63 ns | 159.76 ns | 337,336.8 ns | 336,850.6 ns | 337,471.5 ns | 1.00 | Base | 3 B | 1.00 |
| Dequeue_And_Enqueue | Job-NWPZLS | /runtime_csel/ | 1000 | 337,909.2 ns | 132.52 ns | 110.66 ns | 337,888.3 ns | 337,696.1 ns | 338,103.8 ns | 1.00 | Same | 3 B | 1.00 |
| Method | Job | Toolchain | Size | Mean | Error | StdDev | Median | Min | Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|-------------------- |----------- |--------------- |----- |-------------:|------------:|------------:|-------------:|-------------:|-------------:|------:|---------------- |----------:|------------:|
| Dequeue_And_Enqueue | Job-TXJJRE | /runtime_HEAD/ | 10 | 903.6 ns | 5.24 ns | 4.38 ns | 903.9 ns | 896.1 ns | 910.6 ns | 1.00 | Base | - | NA |
| Dequeue_And_Enqueue | Job-YZYMIP | /runtime_csel/ | 10 | 911.3 ns | 4.31 ns | 4.03 ns | 911.5 ns | 902.4 ns | 917.1 ns | 1.01 | Same | - | NA |
| | | | | | | | | | | | | | |
| Dequeue_And_Enqueue | Job-TXJJRE | /runtime_HEAD/ | 100 | 22,832.6 ns | 131.61 ns | 109.90 ns | 22,821.6 ns | 22,667.4 ns | 22,990.8 ns | 1.00 | Base | - | NA |
| Dequeue_And_Enqueue | Job-YZYMIP | /runtime_csel/ | 100 | 22,600.3 ns | 64.18 ns | 60.03 ns | 22,602.5 ns | 22,486.4 ns | 22,695.0 ns | 0.99 | Same | - | NA |
| | | | | | | | | | | | | | |
| Dequeue_And_Enqueue | Job-TXJJRE | /runtime_HEAD/ | 1000 | 373,532.0 ns | 5,154.39 ns | 4,024.21 ns | 372,391.7 ns | 372,194.4 ns | 386,308.0 ns | 1.00 | Base | 3 B | 1.00 |
| Dequeue_And_Enqueue | Job-YZYMIP | /runtime_csel/ | 1000 | 338,480.7 ns | 3,678.06 ns | 2,871.59 ns | 337,650.6 ns | 337,565.9 ns | 347,596.6 ns | 0.91 | Faster | 3 B | 1.00 |
New version allows for side effects in the JTRUE condition.
These are the results of an spmidiff run on the latest:
[12:10:59] Found 547 files with textual diffs.
[12:10:59]
[12:10:59] Summary of Code Size diffs:
[12:10:59] (Lower is better)
[12:10:59]
[12:10:59] Total bytes of base: 138017124 (overridden on cmd)
[12:10:59] Total bytes of diff: 138016688 (overridden on cmd)
[12:10:59] Total bytes of delta: -436 (-0.00 % of base)
[12:10:59] diff is an improvement.
[12:10:59] relative diff is an improvement.
[12:10:59]
[12:10:59]
[12:10:59] Top file regressions (bytes):
[12:10:59] 4 : 163904.dasm (0.115% of base)
[12:10:59]
[12:10:59] Top file improvements (bytes):
[12:10:59] -52 : 73782.dasm (-3.242% of base)
[12:10:59] -20 : 217413.dasm (-4.348% of base)
[12:10:59] -16 : 131576.dasm (-0.524% of base)
[12:10:59] -16 : 18437.dasm (-8.889% of base)
[12:10:59] -16 : 37694.dasm (-7.143% of base)
[12:10:59] -16 : 17846.dasm (-8.333% of base)
[12:10:59] -12 : 18481.dasm (-1.429% of base)
[12:10:59] -12 : 47579.dasm (-0.419% of base)
[12:10:59] -12 : 217390.dasm (-1.382% of base)
[12:10:59] -8 : 18840.dasm (-2.703% of base)
[12:10:59] -8 : 36599.dasm (-5.405% of base)
[12:10:59] -8 : 58863.dasm (-0.357% of base)
[12:10:59] -8 : 93319.dasm (-0.405% of base)
[12:10:59] -8 : 191.dasm (-0.366% of base)
[12:10:59] -8 : 23704.dasm (-2.857% of base)
[12:10:59] -8 : 28672.dasm (-2.703% of base)
[12:10:59] -8 : 19336.dasm (-2.703% of base)
[12:10:59] -8 : 36467.dasm (-5.405% of base)
[12:10:59] -8 : 17637.dasm (-2.703% of base)
[12:10:59] -8 : 204836.dasm (-5.405% of base)
[12:10:59]
[12:10:59] 51 total files with Code Size differences (50 improved, 1 regressed), 572 unchanged.
[12:10:59]
[12:10:59] Top method regressions (bytes):
[12:10:59] 4 (0.115% of base) : 163904.dasm - Microsoft.CodeAnalysis.CSharp.MethodCompiler:CompileNamedType(Microsoft.CodeAnalysis.CSharp.Symbols.NamedTypeSymbol):this
[12:10:59]
[12:10:59] Top method improvements (bytes):
[12:10:59] -52 (-3.242% of base) : 73782.dasm - Microsoft.Build.BackEnd.Logging.EventSourceSink:Consume(Microsoft.Build.Framework.BuildEventArgs):this
[12:10:59] -20 (-4.348% of base) : 217413.dasm - DevDiv_278372:Test(int,int,int,int,int,int):int
[12:10:59] -16 (-8.889% of base) : 18437.dasm - System.Collections.Generic.Stack`1[__Canon][System.__Canon]:PushWithResize(System.__Canon):this
[12:10:59] -16 (-7.143% of base) : 37694.dasm - System.Collections.Generic.Stack`1[AndNode][System.Dynamic.BindingRestrictions+TestBuilder+AndNode]:PushWithResize(AndNode):this
[12:10:59] -16 (-8.333% of base) : 17846.dasm - System.Collections.Generic.Stack`1[Int32][System.Int32]:PushWithResize(int):this
[12:10:59] -16 (-0.524% of base) : 131576.dasm - VectorTest:Main():int
[12:10:59] -12 (-1.429% of base) : 18481.dasm - Internal.IL.InstantiatedMethodIL:GetObject(int,int):System.Object:this
[12:10:59] -12 (-1.382% of base) : 217390.dasm - System.Linq.Enumerable:MaxInteger(System.Collections.Generic.IEnumerable`1[Int32]):int
[12:10:59] -12 (-0.419% of base) : 47579.dasm - System.Xml.Schema.ParticleContentValidator:BuildTransitionTable(System.Xml.Schema.BitSet,System.Xml.Schema.BitSet[],int):System.Int32[][]:this
[12:10:59] -8 (-0.366% of base) : 191.dasm - <ReadBufferAsync>d__16:MoveNext():this
[12:10:59] -8 (-1.869% of base) : 119952.dasm - GitHub_18497:Main():int
[12:10:59] -8 (-0.187% of base) : 17523.dasm - Internal.TypeSystem.Ecma.EcmaModule:ResolveTypeReference(System.Reflection.Metadata.TypeReferenceHandle):System.Object:this
[12:10:59] -8 (-2.857% of base) : 23704.dasm - System.Collections.Generic.List`1[__Canon][System.__Canon]:AddWithResize(System.__Canon):this
[12:10:59] -8 (-2.469% of base) : 65713.dasm - System.Collections.Generic.List`1[__Canon][System.__Canon]:AddWithResize(System.__Canon):this
[12:10:59] -8 (-5.405% of base) : 36599.dasm - System.Collections.Generic.List`1[__Canon][System.__Canon]:Grow(int):this
[12:10:59] -8 (-5.405% of base) : 204836.dasm - System.Collections.Generic.List`1[__Canon][System.__Canon]:Grow(int):this
[12:10:59] -8 (-2.703% of base) : 17332.dasm - System.Collections.Generic.List`1[Boolean][System.Boolean]:AddWithResize(bool):this
[12:10:59] -8 (-2.381% of base) : 65658.dasm - System.Collections.Generic.List`1[Char][System.Char]:AddWithResize(ushort):this
[12:10:59] -8 (-2.703% of base) : 23641.dasm - System.Collections.Generic.List`1[Char][System.Char]:AddWithResize(ushort):this
[12:10:59] -8 (-5.405% of base) : 36467.dasm - System.Collections.Generic.List`1[Char][System.Char]:Grow(int):this
[12:10:59]
[12:10:59] Top method regressions (percentages):
[12:10:59] 4 (0.115% of base) : 163904.dasm - Microsoft.CodeAnalysis.CSharp.MethodCompiler:CompileNamedType(Microsoft.CodeAnalysis.CSharp.Symbols.NamedTypeSymbol):this
[12:10:59]
[12:10:59] Top method improvements (percentages):
[12:10:59] -16 (-8.889% of base) : 18437.dasm - System.Collections.Generic.Stack`1[__Canon][System.__Canon]:PushWithResize(System.__Canon):this
[12:10:59] -16 (-8.333% of base) : 17846.dasm - System.Collections.Generic.Stack`1[Int32][System.Int32]:PushWithResize(int):this
[12:10:59] -16 (-7.143% of base) : 37694.dasm - System.Collections.Generic.Stack`1[AndNode][System.Dynamic.BindingRestrictions+TestBuilder+AndNode]:PushWithResize(AndNode):this
[12:10:59] -8 (-5.405% of base) : 36599.dasm - System.Collections.Generic.List`1[__Canon][System.__Canon]:Grow(int):this
[12:10:59] -8 (-5.405% of base) : 204836.dasm - System.Collections.Generic.List`1[__Canon][System.__Canon]:Grow(int):this
[12:10:59] -8 (-5.405% of base) : 36467.dasm - System.Collections.Generic.List`1[Char][System.Char]:Grow(int):this
[12:10:59] -8 (-5.405% of base) : 204535.dasm - System.Collections.Generic.List`1[Char][System.Char]:Grow(int):this
[12:10:59] -8 (-4.762% of base) : 28673.dasm - System.Collections.Generic.List`1[Int32][System.Int32]:Grow(int):this
[12:10:59] -20 (-4.348% of base) : 217413.dasm - DevDiv_278372:Test(int,int,int,int,int,int):int
[12:10:59] -52 (-3.242% of base) : 73782.dasm - Microsoft.Build.BackEnd.Logging.EventSourceSink:Consume(Microsoft.Build.Framework.BuildEventArgs):this
[12:10:59] -4 (-2.857% of base) : 18560.dasm - Internal.TypeSystem.MetadataTypeSystemContext:ComputeHasStaticConstructor(Internal.TypeSystem.TypeDesc):bool:this
[12:10:59] -8 (-2.857% of base) : 23704.dasm - System.Collections.Generic.List`1[__Canon][System.__Canon]:AddWithResize(System.__Canon):this
[12:10:59] -8 (-2.740% of base) : 154540.dasm - System.Collections.Generic.List`1[Double][System.Double]:AddWithResize(double):this
[12:10:59] -8 (-2.703% of base) : 17332.dasm - System.Collections.Generic.List`1[Boolean][System.Boolean]:AddWithResize(bool):this
[12:10:59] -8 (-2.703% of base) : 23641.dasm - System.Collections.Generic.List`1[Char][System.Char]:AddWithResize(ushort):this
[12:10:59] -8 (-2.703% of base) : 17637.dasm - System.Collections.Generic.List`1[CorJitFlag][Internal.JitInterface.CorJitFlag]:AddWithResize(int):this
[12:10:59] -8 (-2.703% of base) : 28672.dasm - System.Collections.Generic.List`1[Int32][System.Int32]:AddWithResize(int):this
[12:10:59] -8 (-2.703% of base) : 18840.dasm - System.Collections.Generic.List`1[IntPtr][System.IntPtr]:AddWithResize(long):this
[12:10:59] -8 (-2.703% of base) : 19336.dasm - System.Collections.Generic.List`1[UInt16][System.UInt16]:AddWithResize(ushort):this
[12:10:59] -8 (-2.703% of base) : 8561.dasm - System.Collections.Generic.List`1[Vector`1][System.Numerics.Vector`1[System.Int32]]:AddWithResize(System.Numerics.Vector`1[Int32]):this
[12:10:59]
[12:10:59] 51 total methods with Code Size differences (50 improved, 1 regressed), 572 unchanged.
[12:10:59]
Looking through the remaining outstanding comments:
I fixed up the valuenum.cpp comment.
Note we may see PHIs in the conditional block (even if there's no join) as leftovers from a prior flow optimization.
As the code stands, it would reject any block with PHIs in them. I think that's the behaviour we want.
Does this PR handle int x = cond ? 1 : 0; ?
Does this PR handle
int x = cond ? 1 : 0; ?
No, because that has got the else condition of 0. Plan was to add that in a later patch.
Thinking about it, we'd want to have a specific check so that the true path and false path can both get caught by a single SELECT node.
Which is different entirely to an independent else case: if cond {x=1;} else {y=1;}
Added two Stress options: STRESS_IF_CONVERSION_INNER_LOOPS to allow inner loops and STRESS_IF_CONVERSION_COST to allow all costs.
@a74nh I see some diffs that look like the following:
-G_M30669_IG05: ; gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref, isz
+G_M30669_IG05: ; gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref
mov w0, wzr
tst w1, dotnet/perf-autofiling-issues#12
- beq G_M30669_IG06
- mov w0, dotnet/perf-autofiling-issues#12
- ;; size=16 bbWeight=0.50 PerfScore 1.25
-G_M30669_IG06: ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
+ cset x2, ne
+ mov w3, dotnet/perf-autofiling-issues#12
+ cmp w2, #0
+ csel w0, w3, w0, ne
Seems like we are missing some opportunity here to fold a relop. At the very least it could be:
mov w0, wzr
tst w1, dotnet/perf-autofiling-issues#12
- beq G_M30669_IG06
- mov w0, dotnet/perf-autofiling-issues#12
+ mov w2, dotnet/perf-autofiling-issues#12
+ csel w0, w0, w2, eq
right?
This example is from System.Net.Security.SslStream:GetSslProtocolInternal():int:this in the benchmarks collection. Of course there is some opportunity to fold the wzr too, but I guess you are leaving that up to a future PR.
I know I said I wasn't that worried about the size wise regressions, but the current diffs do look somewhat scary, so if there are some easy cases it would be good to resolve them.
Another case from IfStatements.IfStatements:SingleInner(int) (I guess this one was added by you):
-G_M9208_IG02: ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
- tbnz w0, #0, G_M9208_IG04
- ;; size=4 bbWeight=1 PerfScore 1.00
-G_M9208_IG03: ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
- mov w0, dotnet/perf-autofiling-issues#5
- ;; size=4 bbWeight=0.50 PerfScore 0.25
-G_M9208_IG04: ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
+G_M9208_IG02: ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
+ tst w0, #1
+ cset x1, eq
+ mov w2, dotnet/perf-autofiling-issues#5
+ cmp w1, #0
+ csel w0, w2, w0, ne
This looks odd too. We should maybe skip the transformation for the cbz/cbnz/tbz/tbnz cases for now.
It would be good to run some more of these micro benchmarks with your change and check how they are affected.
Another case from IfStatements.IfStatements:SingleInner(int) (I guess this one was added by you):
Ok, that's due to the JTRUE having an AND statement. Without my patch the AND gets folded away.
N006 ( 7, 9) [000005] ----------- * JTRUE void $VN.Void
N005 ( 5, 7) [000004] J------N--- \--* NE int $101
N003 ( 3, 4) [000014] ----------- +--* AND int $100
N001 ( 1, 1) [000000] ----------- | +--* LCL_VAR int V00 arg0 u:1 $80
N002 ( 1, 2) [000013] ----------- | \--* CNS_INT int 1 $41
N004 ( 1, 2) [000003] ----------- \--* CNS_INT int 0 $40
I've added a small change to avoid any compares against 0. That should get rid of most the regressions. However, it's not going to catch the cases where the comparison is later optimized into a compare against zero
( On a side note, I've been having a quick look at adding else cases. z = (a==0) ? c : d should have no regressions. )
Seems like we are missing some opportunity here to fold a relop
Still need to check this one.
Seems like we are missing some opportunity here to fold a relop
Still need to check this one.
Ahh, this is essentially the same case as above
Ahh, this is essentially the same case as above
Just to confirm, but we should be able to handle that with csel, right?
I assume we are missing some opportunity maybe downstream or in if-conversion itself.
The regressions still look to be some of the 'bad' cases related to bit tests and unnecessary additional relops in the converted variant:
- tbz w1, #1, G_M20791_IG06
- ;; size=8 bbWeight=1 PerfScore 1.50
-G_M20791_IG05: ; gcrefRegs=3E80000 {x19 x21 x22 x23 x24 x25}, byrefRegs=0000 {}, byref
- mov w26, dotnet/perf-autofiling-issues#5
- ;; size=4 bbWeight=0.50 PerfScore 0.25
-G_M20791_IG06: ; gcrefRegs=3E80000 {x19 x21 x22 x23 x24 x25}, byrefRegs=0000 {}, byref, isz
+ tst w1, dotnet/runtime#56343
+ cset x0, ne
+ mov w2, dotnet/perf-autofiling-issues#5
+ cmp w0, #0
+ csel w26, w2, w26, ne
(System.Reflection.Emit.SignatureHelper:GetMethodSigHelper in benchmarks collection)
This diff in LUDecomp:lubksb(double[][],int,int[],double[]) from benchmarks looks odd as well:
G_M64421_IG16: ; gcrefRegs=000D {x0 x2 x3}, byrefRegs=0200 {x9}, byref, isz
fcmp d16, #0.0
- beq G_M64421_IG17
- mov w4, w5
- ;; size=12 bbWeight=0.02 PerfScore 0.07
-G_M64421_IG17: ; gcrefRegs=000D {x0 x2 x3}, byrefRegs=0200 {x9}, byref, isz
+ cset x6, gt
+ ble G_M64421_IG17
+ cset x6, lo
+ ;; size=16 bbWeight=0.02 PerfScore 0.08
+G_M64421_IG17: ; gcrefRegs=000D {x0 x2 x3}, byrefRegs=0200 {x9}, byref
+ cmp w6, #0
+ csel w4, w5, w4, ne
+ ;; size=8 bbWeight=0.02 PerfScore 0.02
+G_M64421_IG18: ; gcrefRegs=000D {x0 x2 x3}, byrefRegs=0200 {x9}, byref, isz
May be good to extract some (float) relops into some tests to get some specific coverage of the if conversion pass.