runtime Implement DivRem intrinsic for X86

Closes #27292.

Some part of code was took from #37928 and #64864. I haven't fully understood all the concepts indeed. Needs some JIT expert to explain them.

Not implementing for Mono because I can't understand the code, and this one has special register constraints.

Mar 13 '22 09:03 huoyaoyuan

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

Mar 13 '22 09:03 dotnet-issue-labeler[bot]

Tagging subscribers to this area: @JulieLeeMSFT See info in area-owners.md if you want to be subscribed.

Issue Details

Closes #27292.

Some part of code was took from #37928 and #64864. I haven't fully understood all the concepts indeed. Needs some JIT expert to explain them.

Not implementing for Mono because I can't understand the code, and this one has special register constraints.

Author:	huoyaoyuan
Assignees:	-
Labels:	`area-CodeGen-coreclr`, `new-api-needs-documentation`
Milestone:	-

Mar 13 '22 09:03 ghost

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics See info in area-owners.md if you want to be subscribed.

Issue Details

Closes #27292.

Some part of code was took from #37928 and #64864. I haven't fully understood all the concepts indeed. Needs some JIT expert to explain them.

Not implementing for Mono because I can't understand the code, and this one has special register constraints.

Author:	huoyaoyuan
Assignees:	-
Labels:	`area-System.Runtime.Intrinsics`, `new-api-needs-documentation`
Milestone:	-

Mar 13 '22 09:03 ghost

Code gen for sample method:

        public static ulong XL(ulong a, ulong b)
        {
            (ulong q, ulong r) = Math.DivRem(a, b);
            return q + r;
        }

G_M52626_IG01:
       mov      qword ptr [rsp+10H], rdx
						;; bbWeight=1    PerfScore 1.00

G_M52626_IG02:
       xor      edx, edx
       mov      rax, rcx
       div      rdx:rax, qword ptr [rsp+10H]
       add      rax, rdx
						;; bbWeight=1    PerfScore 61.75

G_M52626_IG03:
       ret      
						;; bbWeight=1    PerfScore 1.00
; Total bytes of code: 19

In one commit it uses R8 instead of stack, but fails with register confliction in optimized code.

It can be worse if RDX is used in calling convention. For the following method:

        public long DivRemInt64(long a, long b)
        {
            (long q, long r) = Math.DivRem(a, b);
            return q + r;
        }

It saves a copy to stack:

       mov       [rsp+10],rdx
       sar       rdx,3F
       mov       rax,[rsp+10]
       idiv      r8
       add       rax,rdx
       ret
; Total bytes of code 21

And causes micro benchmark regresses. But real world sample can differ by register assignation.

Mar 13 '22 10:03 huoyaoyuan

Do I need to implement for Mono in this PR or create an issue to track it? This is an already implemented ISA and used in corelib. Lack of implementation would cause existing tests to fail. /cc @vargaz

Mar 13 '22 10:03 huoyaoyuan

Maybe put the usage inside an #ifdef CORECLR for now and disable the relevant tests for mono ?

Mar 13 '22 15:03 vargaz

I can't prevent LSRA from allocating op3 into EAX, and it causes random assertion failures, no matter using DelayFree or not. If I've understood correctly, both EAX and EDX are exposed by BuildDef, so there's no need to kill them right?

@dotnet/jit-contrib I need some help here.

Edit: I misunderstood RegNum. In case the operand comes from method returned RAX, it's correctly spilled into a temp, but the RegNum of the operand keeps RAX. Realized it after viewing JIT dump in deep.

Mar 14 '22 16:03 huoyaoyuan

All the XARCH tests passed. ARM64 failure looks like a timeout. Marking as ready.

Mar 15 '22 05:03 huoyaoyuan

The failure looks unrelated, but I'm not 100% sure that I didn't introduce any hole into JIT, since the usage of Math.DivRem is not rare.

Mar 15 '22 11:03 huoyaoyuan

I think I've understood how to implement it for Mono. However, currently Mono uses X64 subclass to decide 64bit, which is unsuitable for nint. Leaving for the Mono team to implement it.

Mar 16 '22 15:03 huoyaoyuan

Nice! However, It feels weird to me to expose it as a public API - I think it's a trivial (not in terms of this implementation, but in general) thing users expect JIT to handle like other compilers do

Mar 16 '22 15:03 EgorBo

To clarify myself - it's fine and probably even great to have GT_INTRINSIC in JIT that this PR does so then we can recognize the div/mod pattern and replace with this intrinsic - I was only concerning about the C# API.

I'm looking forward to seeing this merged

Mar 16 '22 15:03 EgorBo

It's capable to do 128bit division, though maybe not useful. The lack of CDQ/CQO method makes the potential result worse than recognizing the div then mod pattern. Is it reliable to recognize >>31?

Mar 16 '22 15:03 huoyaoyuan

Nice! However, It feels weird to me to expose it as a public API - I think it's a trivial (not in terms of this implementation, but in general) thing users expect JIT to handle like other compilers do

As from the API review, it is being exposed because x86/x64 div/idiv are special and are quite a bit more complex than just doing div + rem. They also allow 128 / 64, 64 / 32, 32 / 16, and 16 / 8 operations as well as a few other specializations that make them "different enough" that it warrants a special API rather than a general xplat helper.

There are cases where the JIT itself could be improved here as well, certainly, but some of the specializations can be quite complex and this is something that even C/C++ compilers provide as part of their intrinsics repertoires.

Mar 16 '22 15:03 tannergooding

I can give this a general review, but its also going to require a review from someone in @dotnet/jit-contrib

Mar 23 '22 21:03 tannergooding

@huoyaoyuan, could you resolve the merge conflict and ensure this is up to date with the latest main? It's been a while and we need to ensure CI results are current.

This still needs review from @dotnet/jit-contrib

May 19 '22 15:05 tannergooding

Ping for @dotnet/jit-contrib - can we get this in .NET 7?

Jul 14 '22 02:07 huoyaoyuan

@kunalspathak, Libraries team needs a code review from JIT team. This is for .NET 7. PTAL. cc @dotnet/jit-contrib.

Jul 19 '22 19:07 JulieLeeMSFT

@kunalspathak, Libraries team needs a code review from JIT team. This is for .NET 7. PTAL.

I don't think we should merge this in .NET 7. There are many unknowns in the code and some, even I am not familiar with and having them right now feels risky. We can certainly revise and take it once we have a branch for .NET 8.

Jul 20 '22 05:07 kunalspathak

Failures related to https://github.com/dotnet/runtime/issues/76041

Sep 23 '22 16:09 kunalspathak

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details

Closes #27292.

Some part of code was took from #37928 and #64864. I haven't fully understood all the concepts indeed. Needs some JIT expert to explain them.

Not implementing for Mono because I can't understand the code, and this one has special register constraints.

Author:	huoyaoyuan
Assignees:	-
Labels:	`area-CodeGen-coreclr`, `new-api-needs-documentation`
Milestone:	8.0.0

Oct 31 '22 16:10 ghost

@huoyaoyuan, please resolve the merge conflict.

Nov 07 '22 09:11 JulieLeeMSFT

@huoyaoyuan, please resolve the merge conflict.

I will resolve this. As promised, I need to also investigate the CI errors that @huoyaoyuan was hitting.

Nov 07 '22 14:11 kunalspathak

I resolved merge conflicts and now failing to compile SPC.dll. On preliminary debugging looks like because of https://github.com/dotnet/runtime/pull/76766, we now expect the 2 structs involved in assignment should match, but they aren't.

I hit this error:

D:\git\dotnet-runtime\src\coreclr\jit\morphblock.cpp:815
Assertion failed 'm_dst->OperIs(GT_LCL_VAR)' in 'System.Math:DivRem(ulong,ulong):System.ValueTuple`2[ulong,ulong]' during 'Morph - Global' (IL size 33; hash 0x81ec0472; FullOpts)

while dealing with the following:

               [000014] -A-XG------                         *  ASG       struct (copy)
               [000013] ---XG--N---                         +--*  BLK       struct<System.ValueTuple`2[System.UInt64, System.UInt64], 16>
               [000012] -----------                         |  \--*  LCL_VAR   byref  V00 RetBuf       
               [000011] m------N---                         \--*  LCL_VAR   struct<System.ValueTuple`2[System.UInt64, System.UInt64], 16>(P) V05 tmp1         
                                                            \--*    long   V05.Item1 (offs=0x00) -> V06 tmp2         
                                                            \--*    long   V05.Item2 (offs=0x08) -> V07 tmp3

I will work with @SingleAccretion offline to understand more about this.

Nov 07 '22 15:11 kunalspathak

Fresh assert:

Assert failure(PID 5880 [0x000016f8], Thread: 1408 [0x0580]): Assertion failed 'idx < MAX_RET_REG_COUNT' in 'System.Version:TryFormat(System.Span`1[ushort],int,byref):bool:this' during 'LSRA allocate' (IL size 245; hash 0x69e91535; Tier1)

    File: D:\a\_work\1\s\src\coreclr\jit\gentree.h Line: 3747
    Image: C:\h\w\B09B0A05\p\dotnet.exe

Dec 07 '22 14:12 kunalspathak

@akoeplinger - who is the best person to talk to about the failures in DivRem?

Dec 09 '22 22:12 kunalspathak

@akoeplinger - who is the best person to talk to about the failures in DivRem?

never mind. It was related to https://github.com/dotnet/runtime/issues/75767. I added annotations to skip for mono.

Dec 09 '22 22:12 kunalspathak

Failure is https://github.com/dotnet/runtime/issues/80666

Jan 18 '23 17:01 kunalspathak

Unconditionally using the intrinsic from Math.DivRem API is going to cause a lot of regressions

I agree. I checked the current implementation vs. DivRem and with DivRem, it definitely looks worse. MyTest1 is the current implementation while MyTest2 is with DivRem. Same with MyTest_const1 and MyTest_const2.

long_example.txt int_example.txt sbyte_example.txt int_example_return_tuple.txt

Jan 20 '23 07:01 kunalspathak

Unconditionally using the intrinsic from Math.DivRem API is going to cause a lot of regressions

I agree. I checked the current implementation vs. DivRem and with DivRem, it definitely looks worse. MyTest1 is the current implementation while MyTest2 is with DivRem. Same with MyTest_const1 and MyTest_const2.

From these samples it looks like the new codegen is actually worse for the non-constant case also? It's repeatedly copying stuff to/from memory instead of using registers?

Jan 20 '23 15:01 pentp