Vector{128,256} operations that use MmShuffle fall back to method call
Prerequisites
- [X] I have written a descriptive issue title
- [X] I have verified that I am running the latest version of ImageSharp
- [X] I have verified if the problem exist in both
DEBUGandRELEASEmode - [X] I have searched open and closed issues to ensure it has not already been reported
ImageSharp version
Current main branch
Other ImageSharp packages and versions
none
Environment (Operating system, version and so on)
all .NET supported
.NET Framework version
all
Description
While working on https://github.com/SixLabors/ImageSharp/issues/1762 I recognized that methods that use https://github.com/SixLabors/ImageSharp/blob/c661ab1478419004271a1bd2281f5d4e478840b9/src/ImageSharp/Common/Helpers/SimdUtils.Shuffle.cs#L236-L238 won't emit platform-intrinsics, rather fallback to a method call as the value isn't a constant.
E.g. Vp8Encoding.FTransformPass1SSE2 looks after inlining the vector constants like
push rdi
push rsi
push rbx
sub rsp,60
vzeroupper
xor eax,eax
mov [rsp+50],rax
mov [rsp+58],rax
mov rsi,rdx
mov rdi,r8
mov rbx,r9
vmovupd xmm0,[rcx]
lea rcx,[rsp+40]
vmovapd [rsp+30],xmm0
lea rdx,[rsp+30]
mov r8d,0B1
call System.Runtime.Intrinsics.X86.Sse2.ShuffleHigh(System.Runtime.Intrinsics.Vector128`1<Int16>, Byte)
vmovupd xmm0,[rsi]
lea rcx,[rsp+50]
vmovapd [rsp+30],xmm0
lea rdx,[rsp+30]
mov r8d,0B1
call System.Runtime.Intrinsics.X86.Sse2.ShuffleHigh(System.Runtime.Intrinsics.Vector128`1<Int16>, Byte)
vmovapd xmm0,[rsp+40]
vpunpcklqdq xmm0,xmm0,[rsp+50]
vmovapd xmm1,[rsp+40]
vpunpckhqdq xmm1,xmm1,[rsp+50]
vpaddw xmm2,xmm0,xmm1
vpmaddwd xmm3,xmm2,[7FF7D3640A20]
vpmaddwd xmm2,xmm2,[7FF7D3640A30]
vpsubw xmm0,xmm0,xmm1
vpmaddwd xmm1,xmm0,[7FF7D3640A40]
vpaddd xmm1,xmm1,[7FF7D3640A50]
vpsrad xmm1,xmm1,9
vpmaddwd xmm0,xmm0,[7FF7D3640A60]
vpaddd xmm0,xmm0,[7FF7D3640A70]
vpsrad xmm0,xmm0,9
vpackssdw xmm0,xmm1,xmm0
vpackssdw xmm1,xmm3,xmm2
vpunpcklwd xmm2,xmm1,xmm0
vpunpckhwd xmm0,xmm1,xmm0
vpunpckhdq xmm1,xmm2,xmm0
vpunpckldq xmm0,xmm2,xmm0
vmovupd [rdi],xmm0
mov rcx,rbx
vmovapd [rsp+20],xmm1
lea rdx,[rsp+20]
mov r8d,4E
call System.Runtime.Intrinsics.X86.Sse2.Shuffle(System.Runtime.Intrinsics.Vector128`1<Int32>, Byte)
nop
add rsp,60
pop rbx
pop rsi
pop rdi
ret
; Total bytes of code 245
If instead SimdUtils.Shuffle.MmShuffle the constant is given as literal, then the code boils down to:
vzeroupper
vpshufhw xmm0,[rdx],0B1
vpshufhw xmm1,[rcx],0B1
vpunpcklqdq xmm2,xmm1,xmm0
vpunpckhqdq xmm0,xmm1,xmm0
vpaddw xmm1,xmm2,xmm0
vpmaddwd xmm3,xmm1,[7FF7D365DAC0]
vpmaddwd xmm1,xmm1,[7FF7D365DAD0]
vpsubw xmm0,xmm2,xmm0
vpmaddwd xmm2,xmm0,[7FF7D365DAE0]
vpaddd xmm2,xmm2,[7FF7D365DAF0]
vpsrad xmm2,xmm2,9
vpmaddwd xmm0,xmm0,[7FF7D365DB00]
vpaddd xmm0,xmm0,[7FF7D365DB10]
vpsrad xmm0,xmm0,9
vpackssdw xmm0,xmm2,xmm0
vpackssdw xmm1,xmm3,xmm1
vpunpcklwd xmm2,xmm1,xmm0
vpunpckhwd xmm0,xmm1,xmm0
vpunpckhdq xmm1,xmm2,xmm0
vpunpckldq xmm0,xmm2,xmm0
vmovupd [r8],xmm0
vpshufd xmm0,xmm1,4E
vmovupd [r9],xmm0
ret
; Total bytes of code 127
This is a de-facto a JIT-limitation, recorded in https://github.com/dotnet/runtime/issues/9989 and https://github.com/dotnet/runtime/issues/38003, also noticed in https://github.com/SixLabors/ImageSharp/pull/1517#discussion_r562059328
As this is quite a difference in code-gen, hence in perf it should show up too, I propose to change to typing the shuffle literals explicetely. E.g.
-Vector128<short> shuf01_p = Sse2.ShuffleHigh(row01, SimdUtils.Shuffle.MmShuffle(2, 3, 0, 1)); // or any similar pattern
+Vector128<short> shuf01_p = Sse2.ShuffleHigh(row01, 0xB1); // MmShuffle(2, 3, 0, 1)
If OK I'd like to tackle this in one shot with https://github.com/SixLabors/ImageSharp/issues/1762 (as I'm touching these pieces anyway). Edit: I was too eager, the PR is out...
Steps to Reproduce
Look at dissassembly of any method that uses SimdUtils.Shuffle.MmShuffle.
Images
No response
@gfoidl This has been on my mind for a long while and I was actually thinking about it the other day.
Ideally I'd like to be able to keep using MMShuffle in some manner as it helps visual the operation and helps porting. I was thinking maybe we could generate a bunch of constants for that using either T4 or source generators. Maybe even a full collection like the following.
https://github.com/john-h-k/MathSharp/blob/706c02a979c9b7e5b55bf5503c999059a0cb2330/sources/MathSharp/Utils/ShuffleValues.cs#L20
Another option is to use binary literals. 0b10_11_00_01 is a bit more readable and 0xB1 but less than MmShuffle(2, 3, 0, 1)
Defining constants like public const int MmShuffle_2301 = 0b10_11_00_01 would also be viable and allow the optimization.
-- Getting the JIT to handle this properly is still on my radar, but its not "simple" and probably won't make .NET 7 unfortunately.
The approach with the constant looks good to me. It's self describing and the optimization kicks in.
How could things like https://github.com/SixLabors/ImageSharp/blob/c661ab1478419004271a1bd2281f5d4e478840b9/src/ImageSharp/Common/Helpers/Shuffle/IComponentShuffle.cs#L113-L119 be handled in order for the JIT to don't emit the software fallback?
Typing a literal or the constant approach (as outlined above) won't work here as it's no compile time constant. Or falls this under the category "trade-off that has to be taken"?
be handled in order for the JIT to don't emit the software fallback?
This pattern could be changed to effectively expose Shuffle(op1), Shuffle(op1, op2) methods instead and those could themselves do Isa.Shuffle(op1, MmShuffle_0123) which would get optimized as expected.
The current issue in the JIT is that we are making a decision during "import" on whether or not a given call should be an intrinsic or left as a call. For most APIs this is "always intrinsic", for the handful of APIs that require a constant operand (because the underlying instruction encoding requires it), we currently just say "if the operand is constant, intrinsic; otherwise leave as call".
This misses some cases, particularly where an input becomes constant after inlining. However, enabling handling for that case requires that we either generate some "small fallback" (not feasible in many cases), emit the 256-case jump table "inline" (which would overall be worse for throughput), or update the JIT to support converting the intrinsic back into a call very late (around rationalization or lowering). The last option is the "best", but its also incredibly complicated to handle and its not bubbled up to the point where I can do that work yet.
@tannergooding thanks for the insights -- much appreciated!