jdk 8325467: Support methods with many arguments in C2

Issue Summary

If a method has a large number of parameters, we currently bail out from C2 compilation.

Changeset

Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes.

Changes:

Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current RegMask interface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed.
Generalize the "chunk"-logic from PhaseChaitin::Select() to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make the PhaseChaitin::Select() code more readable.
Remove all can_represent checks and bailouts.
Performance tuning. A particularly important change is the early-exit optimization in RegMask::overlap, used in the performance-sensitive method PhaseChaitin::interfere_with_live.
Add a new test case TestManyMethodArguments.java and extend an old test TestNestedSynchronize.java.

Testing

GitHub Actions
tier1 to tier4 (and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64.
Standard performance benchmarking. No observed conclusive overall performance degradation/improvement.
Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular PhaseChaitin::remove_bound_register_from_interfering_live_ranges). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, not worth it).

c2-regression

Progress

[ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
[x] Change must not contain extraneous whitespace
[x] Commit message must refer to an issue

Issue

JDK-8325467: Support methods with many arguments in C2 (Enhancement - P4)

Reviewers

Roberto Castañeda Lozano (@robcasloz - Reviewer) 🔄 Re-review required (review applies to b7aa0351)
Vladimir Kozlov (@vnkozlov - Reviewer) 🔄 Re-review required (review applies to 95396668)

Contributors

Roberto Castañeda Lozano <[email protected]>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20404/head:pull/20404
$ git checkout pull/20404

Update a local copy of the PR:
$ git checkout pull/20404
$ git pull https://git.openjdk.org/jdk.git pull/20404/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 20404

View PR using the GUI difftool:
$ git pr show -t 20404

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20404.diff

Using Webrev

Link to Webrev Comment

Jul 31 '24 12:07 dlunde

:wave: Welcome back dlunden! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

Jul 31 '24 12:07 bridgekeeper[bot]

@dlunde This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8325467: Support methods with many arguments in C2

Co-authored-by: Roberto Castañeda Lozano <[email protected]>
Reviewed-by: rcastanedalo, kvn, epeter

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 88 new commits pushed to the master branch:

fd30ae988bc512b5d2a5a3fd1bc1ed351af974c7: 8350550: Preload classes from AOT cache during VM bootstrap
61acdf6512c6ea3123edb9017ef99d851c917b90: 8365065: cancelled ForkJoinPool tasks no longer throw CancellationException
f9b91a7836766189e1ccefabdd39d30ad440146b: 8368050: Validation missing in ClassFile signature factories
... and 85 more: https://git.openjdk.org/jdk/compare/a49856bb044057a738ffc4186e1e5e3916c0254c...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

Jul 31 '24 12:07 openjdk[bot]

@dlunde The following label will be automatically applied to this pull request:

hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

Jul 31 '24 12:07 openjdk[bot]

Webrevs

Aug 07 '24 07:08 mlbridge[bot]

What is the new maximum RegMask size and OptoReg value? What happens when C2 hits that limit? It's not clear from looking at your tests if they are under the limit or test going past the limit. There is at least one hidden limit that would cause an overflow if we accidentally go past it: OptoRegPair reg values currently must fit in short.

Aug 07 '24 08:08 dean-long

Thanks for the comments @dean-long. There is no limit on register mask size, besides that it has to fit in memory. I allocate extended register mask memory in comp_arena, but I realize I should add an explicit check to see if the allocation succeeds or not when growing the register mask (and bail out if it doesn't).

Good catch on the implications of short in OptoRegPair (and int in OptoReg). In practice, I doubt we'll ever reach these limits, but we should still ensure we add checks for this. Do you know why we use int in OptoReg but short in OptoRegPair? I don't see why we should not change it to short in OptoReg as well for consistency.

Aug 07 '24 10:08 dlunde

but I realize I should add an explicit check to see if the allocation succeeds or not when growing the register mask (and bail out if it doesn't)

After a bit of investigation, I do not believe we actually need explicit checks for if the allocation succeeds. This is already handled internally in the arena allocation and the VM will crash if we run out of memory. This should never really happen in practice, but we could set some sanity limit on register mask size if we feel that's appropriate.

Aug 07 '24 11:08 dlunde

Do you know why we use int in OptoReg but short in OptoRegPair? I don't see why we should not change it to short in OptoReg as well for consistency.

I don't know the reason for the inconsistency. I agree they should use the same type. I suggest using OptoReg::Name in OptoRegPair, and changing its type to short, along with checks that we don't overflow the smaller type.

Aug 08 '24 19:08 dean-long

Do you know why we use int in OptoReg but short in OptoRegPair? I don't see why we should not change it to short in OptoReg as well for consistency.

I don't know the reason for the inconsistency. I agree they should use the same type. I suggest using OptoReg::Name in OptoRegPair, and changing its type to short, along with checks that we don't overflow the smaller type.

Platform specific code use int. Converting OptoReg::Name may need more changes then you think. I agree with changing OptoReg::Name type to short but it should be separate from this RFE. short in OptoRegPair is for memory saving. In a lot of places (all?) we use copy value for it.

Aug 09 '24 03:08 vnkozlov

A separate RFE is fine with me. I'm just concerned that unlimited size regmasks could allow OptoRegPair to overflow. Does C2 have a maximum frame size?

Aug 09 '24 07:08 dean-long

Don't we still need RegMask::can_represent() checks to make sure regmask size doesn't exceed what OptoReg and OptoRegPair can represent? Otherwise we will need checks every time we create a new OptoReg or OptoRegPair, or is there a better way?

Aug 12 '24 20:08 dean-long

@dean-long

Does C2 have a maximum frame size?

I'll investigate, unless someone already has the answer.

Don't we still need RegMask::can_represent() checks to make sure regmask size doesn't exceed what OptoReg and OptoRegPair can represent? Otherwise we will need checks every time we create a new OptoReg or OptoRegPair, or is there a better way?

Yes, we should add checks somewhere. My initial thought was to have them whenever growing or offsetting register masks (operations that change what a register mask can represent), and bail out upon failing a check. But, then we may have bailouts whenever, e.g., inserting into register masks, which is probably difficult to handle.

Also, even before my changes, OptoRegs may overflow in PhaseChaitin::Select because, even though register masks have a limit, there is no limit on how much chunk can grow. That is, one could also argue that OptoReg overflow is a separate issue.

I will have to think a bit more about this.

Aug 13 '24 10:08 dlunde

Does C2 have a maximum frame size?

No, but there is arbitrary 1M limit check: chaitin.cpp#L631

Aug 13 '24 18:08 vnkozlov

Thanks to @robcasloz for the comments and contributions!

/contributor add @robcasloz

Aug 20 '24 14:08 dlunde

@dlunde Contributor Roberto Castañeda Lozano <[email protected]> successfully added.

Aug 20 '24 14:08 openjdk[bot]

Would it be possible to 1) extend test/hotspot/gtest/opto/test_regmask.cpp with tests that exercise extended RegMasks and 2) re-run the standard test tiers with a (temporary) RM_SIZE value that is low enough to also exercise the new logic more often?

Yes, good idea. I'm adding 2) to my TODO list (and as we discussed in-person, you offered to work a bit on 1) yourself).

I think RegMask::is_UP() needs to be updated to handle extended RegMasks (unless we can prove that it will never be called on an extended one).

Thanks, good catch. The problem is that overlap (intentionally) does not consider the all-stack flag. I'll experiment by adding some asserts to see if this is also a more widespread problem. For is_UP, I think the best solution is to rewrite it using RegMask::find_last_elem and OptoReg::is_reg instead.

Other uses of Matcher::STACK_ONLY_mask might also need to be revisited, in particular the line tmp_rm.SUBTRACT(Matcher::STACK_ONLY_mask) in PhaseChaitin::Split.

Yes, I'll have a look. SUBTRACT should be fine though, I have rewritten it to handle such cases.

Aug 23 '24 11:08 dlunde

(and as we discussed in-person, you offered to work a bit on 1) yourself).

Yes, here is an extension of the current test set: https://github.com/openjdk/jdk/commit/4fc5282e24a8449107a66f534a794b8e680537f7, feel free to merge into this PR, and of course make any change you find necessary. The patch introduces new tests for extended RegMasks but also for basic RegMask operations that were not covered. I have also added a few tests that cover working with offsets (rollovers, insertions, deletions, find operations, and different combinations of SUBTRACT_inner). Finally, it introduces a minimal debug-only extension of RegMask which I found necessary to write the tests. Beware that I have only run the new tests locally (linux-x64), so please check that they work on other platforms.

Aug 23 '24 14:08 robcasloz

After doing some simple performance testing, it looks like the best option may, after all, be to simply use bigger static register masks. Please hold off a bit with reviewing! I'm running some more rigorous tests for more architectures over the weekend to make sure there are no caveats.

Aug 23 '24 17:08 dlunde

I explored the possibility of calculating a static upper bound for register mask size given that parameters must fit in 255 32-bit words (from the JVM spec). My conclusion is that the required increase in static register mask size is too expensive, and that we should proceed with my current solution.

Details In C2, register masks must be capable of representing incoming arguments (up to 255) as well as the maximum number of outgoing arguments among all function calls in the compiled method (also up to 255). Additionally, arguments are 64-bit aligned in the frame, which doubles the number of required register mask bits.

Example on my x64 machine (1 word = 32 bits here):

Registers require at minimum 18 words in register masks.
We currently add 4 words for representing arguments, locks, and some other stack locations, so in total 22 words.
To ensure we can fit 255 incoming arguments in the mask, we need 255 * 2 bits = 16 words.
To ensure we can fit 255 outgoing arguments in the mask, we need 255 * 2 bits = 16 words.

That is, we go from 18 + 4 = 22 words to at least 18 + 16 + 16 = 50 words, only taking incoming and outgoing arguments into account. Performance experiments indicate a 3.3% C2 compilation speed degradation (compared to 1% for the solution in this PR).

Aug 27 '24 15:08 dlunde

I agree that we should use your approach instead of big static size.

As we discussed on our previous meeting Aarch64 has very small registers mask - only 10 words. Can you look if that enough or we should increase static size of it? It could be separate RFE.

Sep 03 '24 18:09 vnkozlov

As we discussed on our previous meeting Aarch64 has very small registers mask - only 10 words. Can you look if that enough or we should increase static size of it? It could be separate RFE.

I do have looking at this in my to-do list (as a separate RFE). I'm not sure it is an issue though: the calculation of RM_SIZE first ensures that it covers all registers, and then adds three words to cover arguments, locks, and some other things. If it is only 10 words in total on aarch64, it should be because we simply do not have as many registers that we need to refer to. I do not recall from our discussion, is there some particular case where RM_SIZE on aarch64 is an issue?

Sep 04 '24 09:09 dlunde

is there some particular case where RM_SIZE on aarch64 is an issue?

Both register_aarch64.hpp and register_x86.hpp (64-bits) specify number_of_registers = 32. So why RM_SIZE is different?

Sep 04 '24 16:09 vnkozlov

Both register_aarch64.hpp and register_x86.hpp (64-bits) specify number_of_registers = 32. So why RM_SIZE is different?

The RM_SIZE calculation is based on RegisterForm::_reg_ctr which (I think) is incremented during parsing of the ad-files. As far as I can tell, number_of_registers does not influence this calculation. I can investigate more later on (focusing on updates for this PR now).

Sep 04 '24 16:09 dlunde

Thanks for the review @vnkozlov!

I just pushed an updated version. @dean-long: I believe the update addresses your concerns.

Summary of the most important changes:

Add a (generous) limit to the number of stack slots that BoxLockNodes can occupy in total. If we reach the limit, we bail out of compilation.

Add an upper bound for register mask growth. The upper bound is is (or, should be) impossible to reach, and I've added an assert to check this whenever a register mask grows.

    // Compute a best-effort (statically known) upper bound for register mask
    // size in 32-bit words. When extending/growing register masks, we should
    // never grow past this size.
    static const unsigned int RM_SIZE_MAX =
        (((RM_SIZE_MIN << 5) +                // Slots for machine registers
          (max_method_parameter_length * 2) + // Slots for incoming arguments
          (max_method_parameter_length * 2) + // Slots for outgoing arguments
          BoxLockNode_slot_limit +            // Slots for locks
          64                                  // Padding, reserved words, etc.
          ) + 31) >> 5; // Number of bits -> number of 32-bit words

Add a STATIC_ASSERT that short can index the maximum size register mask.
Add a compilation bailout to PhaseChaitin::Select at register mask chunk rollover, if short can no longer index the rolled-over mask.
Require that the user provides the arena in which to extend register masks directly in register mask constructors. This is slightly more verbose compared to using _comp_arena by default, but safer and more flexible.
Optimize memory consumption by allocating extensions of temporary register masks in a separate resource area (not _comp_arena).
Add @robcasloz's register mask tests.
Improve register mask dumping functionality (I needed it for debugging, could go in a separate RFE).
Make overlap also take the all-stack flag into account (bug).
Rename TestManyMethodArguments.java to TestMaxMethodArguments.java. It now tests that C2 can compile the maximum allowed number of arguments (according to the JVM spec).

Sep 09 '24 12:09 dlunde

Add a STATIC_ASSERT that short can index the maximum size register mask.

This assumes the types in OptoRegPair won't change. Something more future-proof would be to try to construct an OptoRegPair with RM_SIZE_MAX >> 5, then try to read it back. We should probably have OptoRegPair check input values for overflow while we are at it.

Add an upper bound for register mask growth

Does this mean "AllStack" no longer means infinite?

Sep 13 '24 00:09 dean-long

This assumes the types in OptoRegPair won't change. Something more future-proof would be to try to construct an OptoRegPair with RM_SIZE_MAX >> 5, then try to read it back. We should probably have OptoRegPair check input values for overflow while we are at it.

Right, it's better to not hardcode it to short. I'll fix it.

Does this mean "AllStack" no longer means infinite?

"AllStack" still works the same way as before; the growth cap is only for the actual representable bits in the register mask. When rolling over register masks in PhaseChaitin::Select (the main use of the all-stack flag), I have added a sanity bailout if short cannot index the rolled-over mask. I'll change that check to use OptoRegPair instead as well.

Sep 13 '24 08:09 dlunde

I just pushed another update.

Change summary:

Merge with jdk-24+16
Add assert to new rare bailout in PhaseChaitin::Select that should not happen in practice
Refactor OptoRegPair slightly and add asserts checking for overflow (@dean-long).
Improve the static asserts in regmask.hpp (@dean-long)
Performance optimization: add a function to trim watermarks in regmask.hpp and apply it in the SUBTRACT* methods.
Update some tests related to dynamically generating and testing methods with big arities. After this changeset, and in combination with -Xcomp, we compile a lot of methods in these tests that we previously bailed out on. I had to adjust some limits to allow bailing out in the register allocator when things get out of hand. It seems like this is only a problem in these very specific (and artificial) tests, but if it surfaces elsewhere in the future we may need to reevaluate the bailouts we have during register allocation.

Sep 26 '24 14:09 dlunde

@dlunde This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

Oct 25 '24 11:10 bridgekeeper[bot]

New comment to avoid automatic closing.

Oct 28 '24 10:10 dlunde

@dlunde This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

Nov 25 '24 13:11 bridgekeeper[bot]