Implement DivRem intrinsic for X86
Closes #27292.
Some part of code was took from #37928 and #64864. I haven't fully understood all the concepts indeed. Needs some JIT expert to explain them.
Not implementing for Mono because I can't understand the code, and this one has special register constraints.
Note regarding the new-api-needs-documentation label:
This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.
Tagging subscribers to this area: @JulieLeeMSFT See info in area-owners.md if you want to be subscribed.
Issue Details
Closes #27292.
Some part of code was took from #37928 and #64864. I haven't fully understood all the concepts indeed. Needs some JIT expert to explain them.
Not implementing for Mono because I can't understand the code, and this one has special register constraints.
| Author: | huoyaoyuan |
|---|---|
| Assignees: | - |
| Labels: |
|
| Milestone: | - |
Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics See info in area-owners.md if you want to be subscribed.
Issue Details
Closes #27292.
Some part of code was took from #37928 and #64864. I haven't fully understood all the concepts indeed. Needs some JIT expert to explain them.
Not implementing for Mono because I can't understand the code, and this one has special register constraints.
| Author: | huoyaoyuan |
|---|---|
| Assignees: | - |
| Labels: |
|
| Milestone: | - |
Code gen for sample method:
public static ulong XL(ulong a, ulong b)
{
(ulong q, ulong r) = Math.DivRem(a, b);
return q + r;
}
G_M52626_IG01:
mov qword ptr [rsp+10H], rdx
;; bbWeight=1 PerfScore 1.00
G_M52626_IG02:
xor edx, edx
mov rax, rcx
div rdx:rax, qword ptr [rsp+10H]
add rax, rdx
;; bbWeight=1 PerfScore 61.75
G_M52626_IG03:
ret
;; bbWeight=1 PerfScore 1.00
; Total bytes of code: 19
In one commit it uses R8 instead of stack, but fails with register confliction in optimized code.
It can be worse if RDX is used in calling convention. For the following method:
public long DivRemInt64(long a, long b)
{
(long q, long r) = Math.DivRem(a, b);
return q + r;
}
It saves a copy to stack:
mov [rsp+10],rdx
sar rdx,3F
mov rax,[rsp+10]
idiv r8
add rax,rdx
ret
; Total bytes of code 21
And causes micro benchmark regresses. But real world sample can differ by register assignation.
Do I need to implement for Mono in this PR or create an issue to track it? This is an already implemented ISA and used in corelib. Lack of implementation would cause existing tests to fail. /cc @vargaz
Maybe put the usage inside an #ifdef CORECLR for now and disable the relevant tests for mono ?
I can't prevent LSRA from allocating op3 into EAX, and it causes random assertion failures, no matter using DelayFree or not.
If I've understood correctly, both EAX and EDX are exposed by BuildDef, so there's no need to kill them right?
@dotnet/jit-contrib I need some help here.
Edit: I misunderstood RegNum. In case the operand comes from method returned RAX, it's correctly spilled into a temp, but the RegNum of the operand keeps RAX. Realized it after viewing JIT dump in deep.
All the XARCH tests passed. ARM64 failure looks like a timeout. Marking as ready.
The failure looks unrelated, but I'm not 100% sure that I didn't introduce any hole into JIT, since the usage of Math.DivRem is not rare.
I think I've understood how to implement it for Mono. However, currently Mono uses X64 subclass to decide 64bit, which is unsuitable for nint. Leaving for the Mono team to implement it.
Nice! However, It feels weird to me to expose it as a public API - I think it's a trivial (not in terms of this implementation, but in general) thing users expect JIT to handle like other compilers do
To clarify myself - it's fine and probably even great to have GT_INTRINSIC in JIT that this PR does so then we can recognize the div/mod pattern and replace with this intrinsic - I was only concerning about the C# API.
I'm looking forward to seeing this merged
It's capable to do 128bit division, though maybe not useful.
The lack of CDQ/CQO method makes the potential result worse than recognizing the div then mod pattern. Is it reliable to recognize >>31?
Nice! However, It feels weird to me to expose it as a public API - I think it's a trivial (not in terms of this implementation, but in general) thing users expect JIT to handle like other compilers do
As from the API review, it is being exposed because x86/x64 div/idiv are special and are quite a bit more complex than just doing div + rem. They also allow 128 / 64, 64 / 32, 32 / 16, and 16 / 8 operations as well as a few other specializations that make them "different enough" that it warrants a special API rather than a general xplat helper.
There are cases where the JIT itself could be improved here as well, certainly, but some of the specializations can be quite complex and this is something that even C/C++ compilers provide as part of their intrinsics repertoires.
I can give this a general review, but its also going to require a review from someone in @dotnet/jit-contrib
@huoyaoyuan, could you resolve the merge conflict and ensure this is up to date with the latest main? It's been a while and we need to ensure CI results are current.
This still needs review from @dotnet/jit-contrib
Ping for @dotnet/jit-contrib - can we get this in .NET 7?
@kunalspathak, Libraries team needs a code review from JIT team. This is for .NET 7. PTAL. cc @dotnet/jit-contrib.
@kunalspathak, Libraries team needs a code review from JIT team. This is for .NET 7. PTAL.
I don't think we should merge this in .NET 7. There are many unknowns in the code and some, even I am not familiar with and having them right now feels risky. We can certainly revise and take it once we have a branch for .NET 8.
Failures related to https://github.com/dotnet/runtime/issues/76041
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Issue Details
Closes #27292.
Some part of code was took from #37928 and #64864. I haven't fully understood all the concepts indeed. Needs some JIT expert to explain them.
Not implementing for Mono because I can't understand the code, and this one has special register constraints.
| Author: | huoyaoyuan |
|---|---|
| Assignees: | - |
| Labels: |
|
| Milestone: | 8.0.0 |
@huoyaoyuan, please resolve the merge conflict.
@huoyaoyuan, please resolve the merge conflict.
I will resolve this. As promised, I need to also investigate the CI errors that @huoyaoyuan was hitting.
I resolved merge conflicts and now failing to compile SPC.dll. On preliminary debugging looks like because of https://github.com/dotnet/runtime/pull/76766, we now expect the 2 structs involved in assignment should match, but they aren't.
I hit this error:
D:\git\dotnet-runtime\src\coreclr\jit\morphblock.cpp:815
Assertion failed 'm_dst->OperIs(GT_LCL_VAR)' in 'System.Math:DivRem(ulong,ulong):System.ValueTuple`2[ulong,ulong]' during 'Morph - Global' (IL size 33; hash 0x81ec0472; FullOpts)
while dealing with the following:
[000014] -A-XG------ * ASG struct (copy)
[000013] ---XG--N--- +--* BLK struct<System.ValueTuple`2[System.UInt64, System.UInt64], 16>
[000012] ----------- | \--* LCL_VAR byref V00 RetBuf
[000011] m------N--- \--* LCL_VAR struct<System.ValueTuple`2[System.UInt64, System.UInt64], 16>(P) V05 tmp1
\--* long V05.Item1 (offs=0x00) -> V06 tmp2
\--* long V05.Item2 (offs=0x08) -> V07 tmp3
I will work with @SingleAccretion offline to understand more about this.
Fresh assert:
Assert failure(PID 5880 [0x000016f8], Thread: 1408 [0x0580]): Assertion failed 'idx < MAX_RET_REG_COUNT' in 'System.Version:TryFormat(System.Span`1[ushort],int,byref):bool:this' during 'LSRA allocate' (IL size 245; hash 0x69e91535; Tier1)
File: D:\a\_work\1\s\src\coreclr\jit\gentree.h Line: 3747
Image: C:\h\w\B09B0A05\p\dotnet.exe
@akoeplinger - who is the best person to talk to about the failures in DivRem?
@akoeplinger - who is the best person to talk to about the failures in
DivRem?
never mind. It was related to https://github.com/dotnet/runtime/issues/75767. I added annotations to skip for mono.
Failure is https://github.com/dotnet/runtime/issues/80666
Unconditionally using the intrinsic from Math.DivRem API is going to cause a lot of regressions
I agree. I checked the current implementation vs. DivRem and with DivRem, it definitely looks worse. MyTest1 is the current implementation while MyTest2 is with DivRem. Same with MyTest_const1 and MyTest_const2.
long_example.txt int_example.txt sbyte_example.txt int_example_return_tuple.txt
Unconditionally using the intrinsic from Math.DivRem API is going to cause a lot of regressions
I agree. I checked the current implementation vs.
DivRemand withDivRem, it definitely looks worse.MyTest1is the current implementation whileMyTest2is withDivRem. Same withMyTest_const1andMyTest_const2.
From these samples it looks like the new codegen is actually worse for the non-constant case also? It's repeatedly copying stuff to/from memory instead of using registers?