ImageSharp Vector{128,256} operations that use MmShuffle fall back to method call

Prerequisites

[X] I have written a descriptive issue title
[X] I have verified that I am running the latest version of ImageSharp
[X] I have verified if the problem exist in both DEBUG and RELEASE mode
[X] I have searched open and closed issues to ensure it has not already been reported

ImageSharp version

Current main branch

Other ImageSharp packages and versions

none

Environment (Operating system, version and so on)

all .NET supported

.NET Framework version

all

Description

While working on https://github.com/SixLabors/ImageSharp/issues/1762 I recognized that methods that use https://github.com/SixLabors/ImageSharp/blob/c661ab1478419004271a1bd2281f5d4e478840b9/src/ImageSharp/Common/Helpers/SimdUtils.Shuffle.cs#L236-L238 won't emit platform-intrinsics, rather fallback to a method call as the value isn't a constant.

E.g. Vp8Encoding.FTransformPass1SSE2 looks after inlining the vector constants like

       push      rdi
       push      rsi
       push      rbx
       sub       rsp,60
       vzeroupper
       xor       eax,eax
       mov       [rsp+50],rax
       mov       [rsp+58],rax
       mov       rsi,rdx
       mov       rdi,r8
       mov       rbx,r9
       vmovupd   xmm0,[rcx]
       lea       rcx,[rsp+40]
       vmovapd   [rsp+30],xmm0
       lea       rdx,[rsp+30]
       mov       r8d,0B1
       call      System.Runtime.Intrinsics.X86.Sse2.ShuffleHigh(System.Runtime.Intrinsics.Vector128`1<Int16>, Byte)
       vmovupd   xmm0,[rsi]
       lea       rcx,[rsp+50]
       vmovapd   [rsp+30],xmm0
       lea       rdx,[rsp+30]
       mov       r8d,0B1
       call      System.Runtime.Intrinsics.X86.Sse2.ShuffleHigh(System.Runtime.Intrinsics.Vector128`1<Int16>, Byte)
       vmovapd   xmm0,[rsp+40]
       vpunpcklqdq xmm0,xmm0,[rsp+50]
       vmovapd   xmm1,[rsp+40]
       vpunpckhqdq xmm1,xmm1,[rsp+50]
       vpaddw    xmm2,xmm0,xmm1
       vpmaddwd  xmm3,xmm2,[7FF7D3640A20]
       vpmaddwd  xmm2,xmm2,[7FF7D3640A30]
       vpsubw    xmm0,xmm0,xmm1
       vpmaddwd  xmm1,xmm0,[7FF7D3640A40]
       vpaddd    xmm1,xmm1,[7FF7D3640A50]
       vpsrad    xmm1,xmm1,9
       vpmaddwd  xmm0,xmm0,[7FF7D3640A60]
       vpaddd    xmm0,xmm0,[7FF7D3640A70]
       vpsrad    xmm0,xmm0,9
       vpackssdw xmm0,xmm1,xmm0
       vpackssdw xmm1,xmm3,xmm2
       vpunpcklwd xmm2,xmm1,xmm0
       vpunpckhwd xmm0,xmm1,xmm0
       vpunpckhdq xmm1,xmm2,xmm0
       vpunpckldq xmm0,xmm2,xmm0
       vmovupd   [rdi],xmm0
       mov       rcx,rbx
       vmovapd   [rsp+20],xmm1
       lea       rdx,[rsp+20]
       mov       r8d,4E
       call      System.Runtime.Intrinsics.X86.Sse2.Shuffle(System.Runtime.Intrinsics.Vector128`1<Int32>, Byte)
       nop
       add       rsp,60
       pop       rbx
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 245

If instead SimdUtils.Shuffle.MmShuffle the constant is given as literal, then the code boils down to:

       vzeroupper
       vpshufhw  xmm0,[rdx],0B1
       vpshufhw  xmm1,[rcx],0B1
       vpunpcklqdq xmm2,xmm1,xmm0
       vpunpckhqdq xmm0,xmm1,xmm0
       vpaddw    xmm1,xmm2,xmm0
       vpmaddwd  xmm3,xmm1,[7FF7D365DAC0]
       vpmaddwd  xmm1,xmm1,[7FF7D365DAD0]
       vpsubw    xmm0,xmm2,xmm0
       vpmaddwd  xmm2,xmm0,[7FF7D365DAE0]
       vpaddd    xmm2,xmm2,[7FF7D365DAF0]
       vpsrad    xmm2,xmm2,9
       vpmaddwd  xmm0,xmm0,[7FF7D365DB00]
       vpaddd    xmm0,xmm0,[7FF7D365DB10]
       vpsrad    xmm0,xmm0,9
       vpackssdw xmm0,xmm2,xmm0
       vpackssdw xmm1,xmm3,xmm1
       vpunpcklwd xmm2,xmm1,xmm0
       vpunpckhwd xmm0,xmm1,xmm0
       vpunpckhdq xmm1,xmm2,xmm0
       vpunpckldq xmm0,xmm2,xmm0
       vmovupd   [r8],xmm0
       vpshufd   xmm0,xmm1,4E
       vmovupd   [r9],xmm0
       ret
; Total bytes of code 127

This is a de-facto a JIT-limitation, recorded in https://github.com/dotnet/runtime/issues/9989 and https://github.com/dotnet/runtime/issues/38003, also noticed in https://github.com/SixLabors/ImageSharp/pull/1517#discussion_r562059328

As this is quite a difference in code-gen, hence in perf it should show up too, I propose to change to typing the shuffle literals explicetely. E.g.

-Vector128<short> shuf01_p = Sse2.ShuffleHigh(row01, SimdUtils.Shuffle.MmShuffle(2, 3, 0, 1));   // or any similar pattern
+Vector128<short> shuf01_p = Sse2.ShuffleHigh(row01, 0xB1);  // MmShuffle(2, 3, 0, 1)

If OK I'd like to tackle this in one shot with https://github.com/SixLabors/ImageSharp/issues/1762 (as I'm touching these pieces anyway). Edit: I was too eager, the PR is out...

Steps to Reproduce

Look at dissassembly of any method that uses SimdUtils.Shuffle.MmShuffle.

Images

No response

May 18 '22 12:05 gfoidl

@gfoidl This has been on my mind for a long while and I was actually thinking about it the other day.

Ideally I'd like to be able to keep using MMShuffle in some manner as it helps visual the operation and helps porting. I was thinking maybe we could generate a bunch of constants for that using either T4 or source generators. Maybe even a full collection like the following.

https://github.com/john-h-k/MathSharp/blob/706c02a979c9b7e5b55bf5503c999059a0cb2330/sources/MathSharp/Utils/ShuffleValues.cs#L20

May 18 '22 13:05 JimBobSquarePants

Another option is to use binary literals. 0b10_11_00_01 is a bit more readable and 0xB1 but less than MmShuffle(2, 3, 0, 1)

Defining constants like public const int MmShuffle_2301 = 0b10_11_00_01 would also be viable and allow the optimization.

-- Getting the JIT to handle this properly is still on my radar, but its not "simple" and probably won't make .NET 7 unfortunately.

May 18 '22 14:05 tannergooding

The approach with the constant looks good to me. It's self describing and the optimization kicks in.

May 18 '22 14:05 gfoidl

How could things like https://github.com/SixLabors/ImageSharp/blob/c661ab1478419004271a1bd2281f5d4e478840b9/src/ImageSharp/Common/Helpers/Shuffle/IComponentShuffle.cs#L113-L119 be handled in order for the JIT to don't emit the software fallback?

Typing a literal or the constant approach (as outlined above) won't work here as it's no compile time constant. Or falls this under the category "trade-off that has to be taken"?

May 18 '22 14:05 gfoidl

be handled in order for the JIT to don't emit the software fallback?

This pattern could be changed to effectively expose Shuffle(op1), Shuffle(op1, op2) methods instead and those could themselves do Isa.Shuffle(op1, MmShuffle_0123) which would get optimized as expected.

The current issue in the JIT is that we are making a decision during "import" on whether or not a given call should be an intrinsic or left as a call. For most APIs this is "always intrinsic", for the handful of APIs that require a constant operand (because the underlying instruction encoding requires it), we currently just say "if the operand is constant, intrinsic; otherwise leave as call".

This misses some cases, particularly where an input becomes constant after inlining. However, enabling handling for that case requires that we either generate some "small fallback" (not feasible in many cases), emit the 256-case jump table "inline" (which would overall be worse for throughput), or update the JIT to support converting the intrinsic back into a call very late (around rationalization or lowering). The last option is the "best", but its also incredibly complicated to handle and its not bubbled up to the point where I can do that work yet.

May 18 '22 14:05 tannergooding

@tannergooding thanks for the insights -- much appreciated!

May 18 '22 15:05 gfoidl