8325467: Support methods with many arguments in C2
Issue Summary
If a method has a large number of parameters, we currently bail out from C2 compilation.
Changeset
Allowing C2 compilation of methods with a large number of parameters requires fundamental changes to the register mask data structure, used in many places in C2. In particular, register masks currently have a statically determined size and cannot represent arbitrary numbers of stack slots. This is needed if we want to compile methods with arbitrary numbers of parameters. Register mask operations are present in performance-sensitive parts of C2, which further complicates changes.
Changes:
- Add functionality to dynamically grow/extend register masks. I experimented with a number of design choices to achieve this. To keep the common case (normal number of method parameters) quick and also to avoid more intrusive changes to the current
RegMaskinterface, I decided to leave the "base" statically allocated memory for masks unchanged and only use dynamically allocated memory in the rare cases where it is needed. - Generalize the "chunk"-logic from
PhaseChaitin::Select()to allow arbitrary-sized chunks, and also move most of the logic into register mask methods to separate concerns and to make thePhaseChaitin::Select()code more readable. - Remove all
can_representchecks and bailouts. - Performance tuning. A particularly important change is the early-exit optimization in
RegMask::overlap, used in the performance-sensitive methodPhaseChaitin::interfere_with_live. - Add a new test case
TestManyMethodArguments.javaand extend an old testTestNestedSynchronize.java.
Testing
- GitHub Actions
-
tier1totier4(and additional Oracle-internal testing) on Windows x64, Linux x64, Linux aarch64, macOS x64, and macOS aarch64. - Standard performance benchmarking. No observed conclusive overall performance degradation/improvement.
- Specific benchmarking of C2 compilation time. The changes increase C2 compilation time by, approximately and on average, 1% for methods that could also be compiled before this changeset (see the figure below). The reason for the degradation is further checks required in performance-sensitive code (in particular
PhaseChaitin::remove_bound_register_from_interfering_live_ranges). I have tried optimizing in various ways, but changes I found that lead to improvement also lead to less readable code (and are, in my opinion, not worth it).
Progress
- [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
Issue
- JDK-8325467: Support methods with many arguments in C2 (Enhancement - P4)
Reviewers
- Roberto Castañeda Lozano (@robcasloz - Reviewer) 🔄 Re-review required (review applies to b7aa0351)
- Vladimir Kozlov (@vnkozlov - Reviewer) 🔄 Re-review required (review applies to 95396668)
Contributors
- Roberto Castañeda Lozano
<[email protected]>
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/20404/head:pull/20404
$ git checkout pull/20404
Update a local copy of the PR:
$ git checkout pull/20404
$ git pull https://git.openjdk.org/jdk.git pull/20404/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 20404
View PR using the GUI difftool:
$ git pr show -t 20404
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/20404.diff
Using Webrev
:wave: Welcome back dlunden! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
@dlunde This change now passes all automated pre-integration checks.
ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.
After integration, the commit message for the final commit will be:
8325467: Support methods with many arguments in C2
Co-authored-by: Roberto Castañeda Lozano <[email protected]>
Reviewed-by: rcastanedalo, kvn, epeter
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.
At the time when this comment was updated there had been 88 new commits pushed to the master branch:
- fd30ae988bc512b5d2a5a3fd1bc1ed351af974c7: 8350550: Preload classes from AOT cache during VM bootstrap
- 61acdf6512c6ea3123edb9017ef99d851c917b90: 8365065: cancelled ForkJoinPool tasks no longer throw CancellationException
- f9b91a7836766189e1ccefabdd39d30ad440146b: 8368050: Validation missing in ClassFile signature factories
- ... and 85 more: https://git.openjdk.org/jdk/compare/a49856bb044057a738ffc4186e1e5e3916c0254c...master
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.
➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.
@dlunde The following label will be automatically applied to this pull request:
-
hotspot-compiler
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.
Webrevs
- 32: Full - Incremental (b61dd25c)
- 31: Full - Incremental (1dd5084f)
- 30: Full (84efc2db)
- 29: Full - Incremental (e165c961)
- 28: Full - Incremental (9b04b5a7)
- 27: Full - Incremental (fe69f5a3)
- 26: Full (c1f41288)
- 25: Full - Incremental (f250a061)
- 24: Full - Incremental (c4a706b5)
- 23: Full (80c6cf47)
- 22: Full - Incremental (78259023)
- 21: Full - Incremental (de5c82db)
- 20: Full - Incremental (7966937f)
- 19: Full (9cefa15f)
- 18: Full - Incremental (d9e4219d)
- 17: Full - Incremental (2df00458)
- 16: Full - Incremental (b7aa0351)
- 15: Full - Incremental (b20d14e5)
- 14: Full - Incremental (74357621)
- 13: Full - Incremental (c41a76b9)
- 12: Full (76f6b8f8)
- 11: Full - Incremental (5be718e8)
- 10: Full - Incremental (fbfddb29)
- 09: Full - Incremental (e370f61f)
- 08: Full - Incremental (0a5f5c84)
- 07: Full - Incremental (76d182e4)
- 06: Full (a54d8a37)
- 05: Full - Incremental (1bec1692)
- 04: Full - Incremental (36f9dabf)
- 03: Full - Incremental (95396668)
- 02: Full - Incremental (4e7f4dbf)
- 01: Full - Incremental (cbb2c251)
- 00: Full (a0589873)
What is the new maximum RegMask size and OptoReg value? What happens when C2 hits that limit? It's not clear from looking at your tests if they are under the limit or test going past the limit. There is at least one hidden limit that would cause an overflow if we accidentally go past it: OptoRegPair reg values currently must fit in short.
Thanks for the comments @dean-long. There is no limit on register mask size, besides that it has to fit in memory. I allocate extended register mask memory in comp_arena, but I realize I should add an explicit check to see if the allocation succeeds or not when growing the register mask (and bail out if it doesn't).
Good catch on the implications of short in OptoRegPair (and int in OptoReg). In practice, I doubt we'll ever reach these limits, but we should still ensure we add checks for this. Do you know why we use int in OptoReg but short in OptoRegPair? I don't see why we should not change it to short in OptoReg as well for consistency.
but I realize I should add an explicit check to see if the allocation succeeds or not when growing the register mask (and bail out if it doesn't)
After a bit of investigation, I do not believe we actually need explicit checks for if the allocation succeeds. This is already handled internally in the arena allocation and the VM will crash if we run out of memory. This should never really happen in practice, but we could set some sanity limit on register mask size if we feel that's appropriate.
Do you know why we use int in OptoReg but short in OptoRegPair? I don't see why we should not change it to short in OptoReg as well for consistency.
I don't know the reason for the inconsistency. I agree they should use the same type. I suggest using OptoReg::Name in OptoRegPair, and changing its type to short, along with checks that we don't overflow the smaller type.
Do you know why we use int in OptoReg but short in OptoRegPair? I don't see why we should not change it to short in OptoReg as well for consistency.
I don't know the reason for the inconsistency. I agree they should use the same type. I suggest using OptoReg::Name in OptoRegPair, and changing its type to short, along with checks that we don't overflow the smaller type.
Platform specific code use int. Converting OptoReg::Name may need more changes then you think. I agree with changing OptoReg::Name type to short but it should be separate from this RFE.
short in OptoRegPair is for memory saving. In a lot of places (all?) we use copy value for it.
A separate RFE is fine with me. I'm just concerned that unlimited size regmasks could allow OptoRegPair to overflow. Does C2 have a maximum frame size?
Don't we still need RegMask::can_represent() checks to make sure regmask size doesn't exceed what OptoReg and OptoRegPair can represent? Otherwise we will need checks every time we create a new OptoReg or OptoRegPair, or is there a better way?
@dean-long
Does C2 have a maximum frame size?
I'll investigate, unless someone already has the answer.
Don't we still need RegMask::can_represent() checks to make sure regmask size doesn't exceed what OptoReg and OptoRegPair can represent? Otherwise we will need checks every time we create a new OptoReg or OptoRegPair, or is there a better way?
Yes, we should add checks somewhere. My initial thought was to have them whenever growing or offsetting register masks (operations that change what a register mask can represent), and bail out upon failing a check. But, then we may have bailouts whenever, e.g., inserting into register masks, which is probably difficult to handle.
Also, even before my changes, OptoRegs may overflow in PhaseChaitin::Select because, even though register masks have a limit, there is no limit on how much chunk can grow. That is, one could also argue that OptoReg overflow is a separate issue.
I will have to think a bit more about this.
Thanks to @robcasloz for the comments and contributions!
/contributor add @robcasloz
@dlunde
Contributor Roberto Castañeda Lozano <[email protected]> successfully added.
Would it be possible to 1) extend test/hotspot/gtest/opto/test_regmask.cpp with tests that exercise extended RegMasks and 2) re-run the standard test tiers with a (temporary) RM_SIZE value that is low enough to also exercise the new logic more often?
Yes, good idea. I'm adding 2) to my TODO list (and as we discussed in-person, you offered to work a bit on 1) yourself).
I think RegMask::is_UP() needs to be updated to handle extended RegMasks (unless we can prove that it will never be called on an extended one).
Thanks, good catch. The problem is that overlap (intentionally) does not consider the all-stack flag. I'll experiment by adding some asserts to see if this is also a more widespread problem. For is_UP, I think the best solution is to rewrite it using RegMask::find_last_elem and OptoReg::is_reg instead.
Other uses of Matcher::STACK_ONLY_mask might also need to be revisited, in particular the line tmp_rm.SUBTRACT(Matcher::STACK_ONLY_mask) in PhaseChaitin::Split.
Yes, I'll have a look. SUBTRACT should be fine though, I have rewritten it to handle such cases.
(and as we discussed in-person, you offered to work a bit on 1) yourself).
Yes, here is an extension of the current test set: https://github.com/openjdk/jdk/commit/4fc5282e24a8449107a66f534a794b8e680537f7, feel free to merge into this PR, and of course make any change you find necessary. The patch introduces new tests for extended RegMasks but also for basic RegMask operations that were not covered. I have also added a few tests that cover working with offsets (rollovers, insertions, deletions, find operations, and different combinations of SUBTRACT_inner). Finally, it introduces a minimal debug-only extension of RegMask which I found necessary to write the tests. Beware that I have only run the new tests locally (linux-x64), so please check that they work on other platforms.
After doing some simple performance testing, it looks like the best option may, after all, be to simply use bigger static register masks. Please hold off a bit with reviewing! I'm running some more rigorous tests for more architectures over the weekend to make sure there are no caveats.
I explored the possibility of calculating a static upper bound for register mask size given that parameters must fit in 255 32-bit words (from the JVM spec). My conclusion is that the required increase in static register mask size is too expensive, and that we should proceed with my current solution.
Details In C2, register masks must be capable of representing incoming arguments (up to 255) as well as the maximum number of outgoing arguments among all function calls in the compiled method (also up to 255). Additionally, arguments are 64-bit aligned in the frame, which doubles the number of required register mask bits.
Example on my x64 machine (1 word = 32 bits here):
- Registers require at minimum 18 words in register masks.
- We currently add 4 words for representing arguments, locks, and some other stack locations, so in total 22 words.
- To ensure we can fit 255 incoming arguments in the mask, we need 255 * 2 bits = 16 words.
- To ensure we can fit 255 outgoing arguments in the mask, we need 255 * 2 bits = 16 words.
That is, we go from 18 + 4 = 22 words to at least 18 + 16 + 16 = 50 words, only taking incoming and outgoing arguments into account. Performance experiments indicate a 3.3% C2 compilation speed degradation (compared to 1% for the solution in this PR).
I agree that we should use your approach instead of big static size.
As we discussed on our previous meeting Aarch64 has very small registers mask - only 10 words. Can you look if that enough or we should increase static size of it? It could be separate RFE.
As we discussed on our previous meeting Aarch64 has very small registers mask - only 10 words. Can you look if that enough or we should increase static size of it? It could be separate RFE.
I do have looking at this in my to-do list (as a separate RFE). I'm not sure it is an issue though: the calculation of RM_SIZE first ensures that it covers all registers, and then adds three words to cover arguments, locks, and some other things. If it is only 10 words in total on aarch64, it should be because we simply do not have as many registers that we need to refer to. I do not recall from our discussion, is there some particular case where RM_SIZE on aarch64 is an issue?
is there some particular case where RM_SIZE on aarch64 is an issue?
Both register_aarch64.hpp and register_x86.hpp (64-bits) specify number_of_registers = 32. So why RM_SIZE is different?
Both register_aarch64.hpp and register_x86.hpp (64-bits) specify number_of_registers = 32. So why RM_SIZE is different?
The RM_SIZE calculation is based on RegisterForm::_reg_ctr which (I think) is incremented during parsing of the ad-files. As far as I can tell, number_of_registers does not influence this calculation. I can investigate more later on (focusing on updates for this PR now).
Thanks for the review @vnkozlov!
I just pushed an updated version. @dean-long: I believe the update addresses your concerns.
Summary of the most important changes:
-
Add a (generous) limit to the number of stack slots that
BoxLockNodes can occupy in total. If we reach the limit, we bail out of compilation. -
Add an upper bound for register mask growth. The upper bound is is (or, should be) impossible to reach, and I've added an
assertto check this whenever a register mask grows.// Compute a best-effort (statically known) upper bound for register mask // size in 32-bit words. When extending/growing register masks, we should // never grow past this size. static const unsigned int RM_SIZE_MAX = (((RM_SIZE_MIN << 5) + // Slots for machine registers (max_method_parameter_length * 2) + // Slots for incoming arguments (max_method_parameter_length * 2) + // Slots for outgoing arguments BoxLockNode_slot_limit + // Slots for locks 64 // Padding, reserved words, etc. ) + 31) >> 5; // Number of bits -> number of 32-bit words -
Add a
STATIC_ASSERTthatshortcan index the maximum size register mask. -
Add a compilation bailout to
PhaseChaitin::Selectat register mask chunk rollover, ifshortcan no longer index the rolled-over mask. -
Require that the user provides the arena in which to extend register masks directly in register mask constructors. This is slightly more verbose compared to using
_comp_arenaby default, but safer and more flexible. -
Optimize memory consumption by allocating extensions of temporary register masks in a separate resource area (not
_comp_arena). -
Add @robcasloz's register mask tests.
-
Improve register mask dumping functionality (I needed it for debugging, could go in a separate RFE).
-
Make
overlapalso take theall-stackflag into account (bug). -
Rename
TestManyMethodArguments.javatoTestMaxMethodArguments.java. It now tests that C2 can compile the maximum allowed number of arguments (according to the JVM spec).
Add a STATIC_ASSERT that short can index the maximum size register mask.
This assumes the types in OptoRegPair won't change. Something more future-proof would be to try to construct an OptoRegPair with RM_SIZE_MAX >> 5, then try to read it back. We should probably have OptoRegPair check input values for overflow while we are at it.
Add an upper bound for register mask growth
Does this mean "AllStack" no longer means infinite?
This assumes the types in OptoRegPair won't change. Something more future-proof would be to try to construct an OptoRegPair with RM_SIZE_MAX >> 5, then try to read it back. We should probably have OptoRegPair check input values for overflow while we are at it.
Right, it's better to not hardcode it to short. I'll fix it.
Does this mean "AllStack" no longer means infinite?
"AllStack" still works the same way as before; the growth cap is only for the actual representable bits in the register mask. When rolling over register masks in PhaseChaitin::Select (the main use of the all-stack flag), I have added a sanity bailout if short cannot index the rolled-over mask. I'll change that check to use OptoRegPair instead as well.
I just pushed another update.
Change summary:
- Merge with jdk-24+16
- Add assert to new rare bailout in
PhaseChaitin::Selectthat should not happen in practice - Refactor
OptoRegPairslightly and add asserts checking for overflow (@dean-long). - Improve the static asserts in
regmask.hpp(@dean-long) - Performance optimization: add a function to trim watermarks in
regmask.hppand apply it in theSUBTRACT*methods. - Update some tests related to dynamically generating and testing methods with big arities. After this changeset, and in combination with
-Xcomp, we compile a lot of methods in these tests that we previously bailed out on. I had to adjust some limits to allow bailing out in the register allocator when things get out of hand. It seems like this is only a problem in these very specific (and artificial) tests, but if it surfaces elsewhere in the future we may need to reevaluate the bailouts we have during register allocation.
@dlunde This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!
New comment to avoid automatic closing.
@dlunde This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!