jdk 8328998: Encoding support for Intel APX extended general-purpose registers

Add instruction encoding support for Intel APX extended general-purpose registers:

Intel Advanced Performance Extensions (APX) doubles the number of general-purpose registers, from 16 to 32. For more information about APX, see https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html.

By specification, instruction encoding remains unchanged for instructions using only the lower 16 GPRs. For cases where one or more instruction operands reference extended GPRs (Egprs), encoding targets either REX2, an extension of REX encoding, or an extended version of EVEX encoding. These new encoding schemes extend or modify existing instruction prefixes only when Egprs are used.

Progress

[x] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
[x] Change must not contain extraneous whitespace
[x] Commit message must refer to an issue

Issue

JDK-8328998: Encoding support for Intel APX extended general-purpose registers (Sub-task - P4)

Reviewers

Sandhya Viswanathan (@sviswa7 - Reviewer) ⚠️ Review applies to 7b3e8ec7
Jatin Bhateja (@jatin-bhateja - Reviewer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18476/head:pull/18476
$ git checkout pull/18476

Update a local copy of the PR:
$ git checkout pull/18476
$ git pull https://git.openjdk.org/jdk.git pull/18476/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 18476

View PR using the GUI difftool:
$ git pr show -t 18476

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18476.diff

Webrev

Link to Webrev Comment

Mar 25 '24 19:03 steveatgh

:wave: Welcome back steveatgh! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

Mar 25 '24 19:03 bridgekeeper[bot]

@steveatgh This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8328998: Encoding support for Intel APX extended general-purpose registers

Reviewed-by: kvn, sviswanathan, jbhateja

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 247 new commits pushed to the master branch:

ddd73b458355bffeaa8e0e5017c27d6c6af2dc94: 8332082: Shenandoah: Use consistent tests to determine when pre-write barrier is active
0a9d1f8c89e946d99f01549515f6044e53992168: 8332749: Broken link in MemorySegment.Scope.html
c9a7b9772d96d9a4825d9da2aacc277534282860: 8332829: [BACKOUT] C2: crash in compiled code because of dependency on removed range check CastIIs
7fd9d6c760c66d3e2f4034cf1a6b1b583ff829a9: 8332340: Add JavacBench as a test case for CDS
417d174aa1b7bd3b5755e5f2352d9bbe6ce6f183: 8331348: Some incremental builds deposit files in the make directory
303ac9f270f567d821d156f3a9d4f4c070f43f95: 8332671: Logging for pretouching thread stacks shows wrong memory range
90758f6735620776fcb60da9e0e2c91a4f53aaf1: 8332808: Always set java.io.tmpdir to a suitable value in the build
e19a421c30534566ba0dea0fa84f812ebeecfc87: 8332720: ubsan: instanceKlass.cpp:3550:76: runtime error: member call on null pointer of type 'struct Array'
2581935b47afaf661a94c8a8e50ce08065d632f6: 8332528: Generate code in SwitchBootstraps.generateTypeSwitch that require fewer adaptations
b890336e111ea8473ae49e9992bc2fd61e716792: 8328083: degrade virtual thread support for GetObjectMonitorUsage
... and 237 more: https://git.openjdk.org/jdk/compare/6f98d8f58f98827ae454c7ce4839de4071d95767...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@vnkozlov, @sviswa7, @jatin-bhateja, @eme64) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

Mar 25 '24 19:03 openjdk[bot]

@steveatgh The following label will be automatically applied to this pull request:

hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

Mar 25 '24 19:03 openjdk[bot]

Hi @steveatgh , Please use newly created JBS entry JDK-8328998 for this draft PR.

Mar 26 '24 07:03 jatin-bhateja

Webrevs

Apr 02 '24 01:04 mlbridge[bot]

How can we be confident that the encoding is correct? Would it be possible to write tests for this? Maybe one that disassembles it and compares the result to a 3rd party disassembler offline or in-process hsdis?

Apr 09 '24 19:04 dean-long

Also: what if the UseAPX is enabled, but the hardware does not support the feature? Don't we usually then automatically disable the flag, if the feature is not present? We do that with UseAVX for example.

Apr 15 '24 09:04 eme64

And I agree with @dean-long : you need to have some test for this. At least some test should have the flag enabled. Something that stresses the registers, and verifies the results could be an idea.

Also: I suspect you would want to put this flag into the IR-framework whitelist, since it has no effect on the IR, right?

Apr 15 '24 09:04 eme64

Can the APX features be simulated, maybe even with SDE?

Now you made the flag EXPERIMENTAL and by default false. What is the roadmap with this? It is generally not great to have default false flags, because the code underneath will just slowly rot and become broken. Is there a plan to eventually make it default true? What stops us from doing that already now?

Apr 15 '24 09:04 eme64

Thank you @dean-long for the comment. I agree, tests are needed. Up to this point we have not had a separate formal tool to test encoding of x86. I did a lot of manual testing by adding loops that used r0-r31in different addressing patterns. I put these in a stub file that would be compiled by hotspot but not executed. I manually compared the disassembly of that against the output of similar assembly included in a small C program and run on the SDE. This worked pretty well for debugging but the manual aspect of it makes it error-prone and it takes a lot of time, too much time if iterating an implementation.

Subsequent pull requests will add encoding support for additional APX instructions (e.g. those using New Data Destination). Maybe one of these PRs can include a tool for testing instruction encoding for APX features. What do you think?

Apr 16 '24 17:04 steveatgh

Thank you @eme64 for the comments. The functionality of the UseAPX flag is, as you point out, incomplete in this pull request. A subsequent PR (see JDK-8329030) will tie the logic of the flag in with a query of the hardware features. It was added in this PR thinking it could be useful for testing or debugging the encoding functionality.

Apr 16 '24 18:04 steveatgh

Should is_src_gpr be set to true for the following: void Assembler::movdl(Register dst, XMMRegister src) { NOT_LP64(assert(VM_Version::supports_sse2(), "")); InstructionAttr attributes(AVX_128bit, /* rex_w / false, / legacy_mode / false, / no_mask_reg / true, / uses_vl */ false); // swap src/dst to get correct prefix int encode = simd_prefix_and_encode(src, xnoreg, as_XMMRegister(dst->encoding()), VEX_SIMD_66, VEX_OPCODE_0F, &attributes); emit_int16(0x7E, (0xC0 | encode)); }

Apr 26 '24 20:04 sviswa7

Should is_src_gpr be set to true for the additional following instructions as well: void Assembler::pextrd(Register dst, XMMRegister src, int imm8) void Assembler::pextrq(Register dst, XMMRegister src, int imm8) void Assembler::pextrb(Register dst, XMMRegister src, int imm8)
void Assembler::extractps(Register dst, XMMRegister src, uint8_t imm8)
void Assembler::pextl(Register dst, Register src1, Address src2)
void Assembler::pdepl(Register dst, Register src1, Address src2)
void Assembler::pextq(Register dst, Register src1, Address src2)
void Assembler::pdepq(Register dst, Register src1, Address src2)
void Assembler::movdq(Register dst, XMMRegister src)

Apr 26 '24 23:04 sviswa7

Also the following instruction is not handled for egprs: void Assembler::popq(Register dst)

Apr 26 '24 23:04 sviswa7

It looks to me that the source and dest are reversed in the following instruction in call to simd_prefix_and_encode, perhaps that should be a separate PR: // Do we have this wrong src and dst reversed in simd_prefix_and_encode? void Assembler::pextrw(Register dst, XMMRegister src, int imm8) { assert(VM_Version::supports_sse2(), ""); InstructionAttr attributes(AVX_128bit, /* rex_w / false, / legacy_mode / _legacy_mode_bw, / no_mask_reg / true, / uses_vl */ false); int encode = simd_prefix_and_encode(as_XMMRegister(dst->encoding()), xnoreg, src, VEX_SIMD_66, VEX_OPCODE_0F, &attributes); emit_int24((unsigned char)0xC5, (0xC0 | encode), imm8); } Once that PR is fixed, is_src_gpr should be set to true for this one as well.

Apr 26 '24 23:04 sviswa7

Should is_src_gpr be set to true for the additional following instructions as well:

Thanks @sviswa7, yes some of these should also have src_is_gpr = true. Fixed: void Assembler::extractps(Register dst, XMMRegister src, uint8_t imm8) void Assembler::pextrd(Register dst, XMMRegister src, int imm8) void Assembler::pextrq(Register dst, XMMRegister src, int imm8) void Assembler::pextrb(Register dst, XMMRegister src, int imm8) void Assembler::movdq(Register dst, XMMRegister src)

These use a different prefix function, one that doesn't use a src_is_gpr parameter: void Assembler::pextl(Register dst, Register src1, Address src2) void Assembler::pdepl(Register dst, Register src1, Address src2) void Assembler::pextq(Register dst, Register src1, Address src2) void Assembler::pdepq(Register dst, Register src1, Address src2)

Apr 27 '24 00:04 steveatgh

@sviswa7 wrote

Also the following instruction is not handled for egprs: void Assembler::popq(Register dst)

Thank you. Updated popq(Register) for egpr support.

Apr 29 '24 21:04 steveatgh

SHA instructions (sha1rnds4, sha1nexte, sha1msg1, sha1msg2, sha256rnds2, sha256msg1, sha256msg2) needs to be encoded using EVEX encoding when egprs are in use.

Apr 30 '24 20:04 sviswa7

It looks to me that the source and dest are reversed in the following instruction in call to simd_prefix_and_encode, perhaps that should be a separate PR: // Do we have this wrong src and dst reversed in simd_prefix_and_encode? void Assembler::pextrw(Register dst, XMMRegister src, int imm8) { assert(VM_Version::supports_sse2(), ""); InstructionAttr attributes(AVX_128bit, /* rex_w / false, / legacy_mode _/ legacy_mode_bw, / no_mask_reg / true, / uses_vl */ false); int encode = simd_prefix_and_encode(as_XMMRegister(dst->encoding()), xnoreg, src, VEX_SIMD_66, VEX_OPCODE_0F, &attributes); emit_int24((unsigned char)0xC5, (0xC0 | encode), imm8); } Once that PR is fixed, is_src_gpr should be set to true for this one as well.

Verified that the pextrw has the operands reversed per the SDM, so please ignore this comment.

May 01 '24 00:05 sviswa7

SHA instructions (sha1rnds4, sha1nexte, sha1msg1, sha1msg2, sha256rnds2, sha256msg1, sha256msg2) needs to be encoded using EVEX encoding when egprs are in use.

Thank you, I missed these. The APX 3.0 spec says xmm register use is limited to 0-15 for SHA instructions. Coincidentally, the new version 4.0 APX spec. also removes support for EVEX promotion of SHA instructions. Given these specs, I don't think any encoding changes are needed. I've added an assert to these 7 instructions to check that only registers < 16 are used.

May 01 '24 17:05 steveatgh

@sviswa7 Thank you for your review comments. Very helpful!

May 02 '24 00:05 steveatgh

@steveatgh Please also do a merge with master.

May 02 '24 17:05 sviswa7

:warning: @steveatgh the full name on your profile does not match the author name in this pull requests' HEAD commit. If this pull request gets integrated then the author name from this pull requests' HEAD commit will be used for the resulting commit. If you wish to push a new commit with a different author name, then please run the following commands in a local repository of your personal fork:

$ git checkout apx-encoding-pr
$ git commit --author='Preferred Full Name <[email protected]>' --allow-empty -m 'Update full name'
$ git push

May 02 '24 17:05 openjdk[bot]

@vnkozlov Are there other things you would like to see for this pull request?

May 02 '24 23:05 steveatgh

Thank you @dean-long for the comment. I agree, tests are needed. Up to this point we have not had a separate formal tool to test encoding of x86. I did a lot of manual testing by adding loops that used r0-r31in different addressing patterns. I put these in a stub file that would be compiled by hotspot but not executed. I manually compared the disassembly of that against the output of similar assembly included in a small C program and run on the SDE. This worked pretty well for debugging but the manual aspect of it makes it error-prone and it takes a lot of time, too much time if iterating an implementation.

Subsequent pull requests will add encoding support for additional APX instructions (e.g. those using New Data Destination). Maybe one of these PRs can include a tool for testing instruction encoding for APX features. What do you think?

When we wrote the AArch64 port, there was no available hardware to test it on. So, we wrote a simulator to test it. However, we ran the risk that if our understanding of instruction encoding was wrong, our assembler and our simulator might appear to work correctly when used together, but the result would not run on real AArch64 hardware once it arrived. So, as well as a simulator for the architecture, we verified the internal HotSpot assembler by checking its encoding against GNU as. See /test/hotspot/gtest/aarch64, where a Python program generates source for both the HotSpot internal assembler and GNU as. I strongly suggest you do something similar. (As a matter for the historical record, this did work. The test found several encoding bugs. Once we got the first real AArch64 hardware, the port worked almost immediately.)

May 03 '24 08:05 theRealAph

In response to @dean-long, @theRealAph wrote:

When we wrote the AArch64 port, there was no available hardware to test it on. So, we wrote a simulator to test it. However, we ran the risk that if our understanding of instruction encoding was wrong, our assembler and our simulator might appear to work correctly when used together, but the result would not run on real AArch64 hardware once it arrived. So, as well as a simulator for the architecture, we verified the internal HotSpot assembler by checking its encoding against GNU as. See /test/hotspot/gtest/aarch64, where a Python program generates source for both the HotSpot internal assembler and GNU as. I strongly suggest you do something similar. (As a matter for the historical record, this did work. The test found several encoding bugs. Once we got the first real AArch64 hardware, the port worked almost immediately.)

Thanks for the description. It would be great to create a similar tool for x86. I tested the encoding manually using the SDE as the authoritative source. It is tedious though and very time consuming.

A subsequent PR in JDK-8329030, perhaps the one that adds encoding support for New Data Destination variants, should include such a tool.

May 03 '24 19:05 steveatgh

Hi @steveatgh ,

I have few more comments.

A) With recent change register only flavors of cvtsi2ss / cvtsi2sd / cvttsd2si/ cvttss2si which are all legacy map 1 instruction and are encoded using REX prefixes at UseAVX=0 will now be promoted to EEVEX which is a fixed 4 byte prefix, we should use REX2 instead. cvtsi2ss_MAP1_with_EEVEX.txt

B) Memory operand flavor of paddd : Missing EVEX tuples for memory operand instructions, it will prevent applying EVEX compressed displacement (disp8*N) encoding optimization. FTR: These are map 1 legacy instruction which could be encoded using SIMD + REX prefix, which adds up to two byte prefix, currently we promote them to VEX encoding in order to zero upper 128 bits, this added additional byte penalty in prefix since it used three byte VEX prefix (c4), now we will encode it using EEVEX if address operands (BASE/INDEX) is a EGPR which will add another byte to prefix since EVEX is a fixed 4 byte prefix. As mentioned above at UseAVX=0 we should encode them using REX2. paddd_MAP1_VEX_now_EEVEX.txt

C) Memory operand flavor of pcmpestri, ptest and vptest.

missing address tuple
legacy mode is true should be false.

Kindly incorporate.

May 11 '24 21:05 jatin-bhateja

A) With recent change register only flavors of cvtsi2ss / cvtsi2sd / cvttsd2si/ cvttss2si which are all legacy map 1 instruction and are encoded using REX prefixes at UseAVX=0 will now be promoted to EEVEX which is a fixed 4 byte prefix, we should use REX2 instead. cvtsi2ss_MAP1_with_EEVEX.txt

SDE ERROR: Illegal instruction at address = 7f3aafa329a0: f3 d5 10 0f 2a c1

I debugged the cause of above assertion failure with UseAVX=0 and found two issues:-

Secondary map ID (0x0f) is being emitted after REX2 prefix, this should not happen since REX.M0 encodes this information.
REX.M0 is not being set, because we are not passing correct value for map1 argument https://github.com/openjdk/jdk/blob/47885cbe8eb215323e9ca8f4a36d422d17521e57/src/hotspot/cpu/x86/assembler_x86.cpp#L11734

If we bring out the 0x0f to the top level assembler routines and remove following newly added codelets we can handle APX even at AVX level 0 and lift AVX512 constraint you introduced since SPECjbb2015 benchmark still explicitly pass UseAVX=0 during measurments.

https://github.com/openjdk/jdk/blob/47885cbe8eb215323e9ca8f4a36d422d17521e57/src/hotspot/cpu/x86/assembler_x86.cpp#L13158 https://github.com/openjdk/jdk/blob/47885cbe8eb215323e9ca8f4a36d422d17521e57/src/hotspot/cpu/x86/assembler_x86.cpp#L13187 https://github.com/openjdk/jdk/blob/47885cbe8eb215323e9ca8f4a36d422d17521e57/src/hotspot/cpu/x86/assembler_x86.cpp#L13347

I am ok to address these limitation in subsequent patches.

May 17 '24 16:05 jatin-bhateja

Hi @vnkozlov,

Is there anything else you would like to see for this PR. If not, I would like to check it in this week.

May 20 '24 14:05 steveatgh

@steveatgh I just saw that we from Oracle did not run any tests for this. Can you please hold off for a day or two until we have the testing completed? I'm sure you did exhaustive testing - but still I'd like to make sure it runs fine on all the x64 machines we have.

May 21 '24 05:05 eme64