jdk 8332455: Improve G1/ParallelGC tasks to not override array lengths

In order to not cause excessive traffic on task queues when scanning large object arrays, G1 and ParallelGC use a way of slicing those arrays into smaller pieces. It overrides the from-space array's length field to track the array slices.

I think it would be cleaner if we improve the tasks such that array slices can be fully encoded in the task and does not require overriding the array length.

This PR borrows the principal encoding and slicing algorithm from Shenandoah (originally written by @shipilev). It also unifies the slicing implementations of the young GC and concurrent marking GC and ParallelGC.

On x86 (32-bit) we don't have enough bits in the single-word task to encode the slicing, so I'm extending the task to 64 bits (pointer and two int32 fields).

I put in some efforts to make sure the shared array slicing uses the user-configurable flags ParGCArrayScanChunk and ObjArrayMarkingStride just as before, but TBH, I don't see the point of having those flags as product flags to begin with. I would probably deprecate and remove ParGCArrayScanChunk, and use the develop flag ObjArrayMarkingStride everywhere. YMMV.

Testing:

[x] hotspot_gc
[x] tier1
[x] tier2

Progress

[ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
[x] Change must not contain extraneous whitespace
[x] Commit message must refer to an issue

Issue

JDK-8332455: Improve G1/ParallelGC tasks to not override array lengths (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/19282/head:pull/19282
$ git checkout pull/19282

Update a local copy of the PR:
$ git checkout pull/19282
$ git pull https://git.openjdk.org/jdk.git pull/19282/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 19282

View PR using the GUI difftool:
$ git pr show -t 19282

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/19282.diff

May 17 '24 12:05 rkennke

:wave: Welcome back rkennke! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

May 17 '24 12:05 bridgekeeper[bot]

❗ This change is not yet ready to be integrated. See the Progress checklist in the description for automated requirements.

May 17 '24 12:05 openjdk[bot]

@rkennke The following label will be automatically applied to this pull request:

hotspot-gc

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

May 17 '24 12:05 openjdk[bot]

@rkennke this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8332455
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

Jun 06 '24 16:06 openjdk[bot]

⚠️ @rkennke This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).

Jun 25 '24 12:06 openjdk[bot]

Webrevs

00: Full (f7febf94)

Jul 04 '24 09:07 mlbridge[bot]

Besides the problems discussed below, I also wondered about the rationale for this change. The JBS issue (incorrectly, see below) describes what G1 currently does, but doesn't say anything about why it's a problem. I see this issue has the lilliput label, so maybe it has something to do with that project? Without that information it's hard to discuss any alternatives.

Correct. And sorry about that. Indeed my motivation to explore alternatives come out of the Lilliput project. I am currently working on 'Lilliput2', that is 4-byte-object-headers (and I've already got a version that runs quite well). While the header (aka mark-word) is only 4-bytes, many GCs still use the whole first word of the object to store the forwarding pointer. This is normally not a problem - the object is already copied and we don't care much about anything in the from-space copy. However, arrays now would have their length-field in the first word, too. Thus overriding the first word of the from-space copy with the forwarding pointer would destroy any information that the GC keeps there (or the other way around).

I knew that in Shenandoah we don't do that, and instead encode the slicing in the tasks (and I heard that ZGC does something similar, but I haven't looked at that, yet), so my goal was to explore if that is feasible for G1, too. At first I only did G1 young STW GC, but digging deeper I realized that G1 concurrent GC and Parallel young GC also use similar approaches (without sharing much). So I also explored if the encoding would work there, too. This is pretty much how I ended up with this PR.

About the address space concern, yes, I am aware of this. We're doing the exact same slicing in Shenandoah since forever, and so far never had a customer who would get to that limit. Our reasoning there was that if one uses this amount of Java heap, using 2x as much native memory for the GC task queues (using the fallback version of the task encoding) would probably don't matter all that much.

But I'll be totally happy to explore other approaches. In-fact, I think at least G1 concurrent mark should be fine, anyway.

The bug report says G1 array slicing uses the from-space array's length field to track the array slices. That isn't true. It uses the to-space array's length field, since it uses PartialArrayTaskStepper.

Right. I got that wrong. However, it still requires the from-space length to remain intact, which is not the case with Lilliput2.

ParallelGC uses the from-space length in PSPromotionManager::process_array_chunk, though there is an RFE to change ParallelGC to also use PartialArrayTaskStepper (JDK-8311163).

Ok, that is good. If we could come up with an approach that makes PartialArrayTaskStepper not depend on the from-space length, that would solve my problem.

I also wish the proposed change was broken up a little bit. Changing all of G1 STW, G1 Concurrent, and Parallel at the same time is hard to review, and I don't think there's a need to do all in one change.

I could do that. But let's first see if we come up with an approach that makes us all happy ;-)

One idea that I just now had (for my particular problem in Lilliput2) is this: we only need the length-field in the from-space copy to keep the actual length of the array, so that we know when to stop, and also to restore the length in the to-space copy. We could copy the length to the from-space offset 8 (instead of 4, which is the length field with Lilliput2), and use that, instead. That should be ok, because surely we don't need to do the slicing for 0-length arrays (in which case we would stomp over the next object), and for larger arrays, we don't care about the from-space elements just like we don't care about the length.

I have a couple of ideas for alternative mechanisms that I might explore.

Would you mind sharing the ideas?

Thanks, Roman

Jul 05 '24 15:07 rkennke

I have a couple of ideas for alternative mechanisms that I might explore.

Would you mind sharing the ideas?

I'll respond to other points separately.

Here's a (very drafty) suggestion. https://github.com/openjdk/jdk/compare/master...kimbarrett:openjdk-jdk:array-splitting?expand=1

(There are lots of specific details about these changes that I don't like, and would want to do a bit differently in a real PR, but it demonstrates the approach, and has passed basic testing. Note that it requires disabling ParallelGC for now to build, e.g. --disable-jvm-feature-parallelgc. There are lots of comments that haven't been updated to reflect associated code changes.)

The idea is to define a C++ object that represents the state of the iteration over an array. So at a minimum the new array and the next array index to start processing from. (We might have other members for convenience, or for use by other variants than the current usage by G1.) The associated ScannerTask entries in the taskqueue contain a pointer to the state object associated with an array. We allocate the state object when we decide to parallelize the scanning of the array.

Use arena allocation for the state objects, so we don't have to search the taskqueues to reclaim them if we need to discard the contents of the queues (such as when we sometimes abort G1 concurrent marking).

Put a free-list allocator in front of the arena allocator, so the high water mark for allocation is based on the number of in-progress arrays rather than the total number that exist.

The states are reference counted, so we know when to return them to the free-list. Each referring ScannerTask we insert in a taskqueue increases the count. When we take one out and have finished processing it then decrement the count. Release the state when its count reaches zero. Manipulation of tasks inside a task queue doesn't need to manipulate the count. While a ScannerTask might get copied around in the queues, that mechanism ensures the number of copies ultimately doesn't change over what's inserted and later taken out for processing.

Jul 09 '24 09:07 kimbarrett

I have a couple of ideas for alternative mechanisms that I might explore.

Would you mind sharing the ideas?

I'll respond to other points separately.

Here's a (very drafty) suggestion. https://github.com/openjdk/jdk/compare/master...kimbarrett:openjdk-jdk:array-splitting?expand=1

A different idea is to have segregated queues for oops and partial array tasks, as is already done in some cases by some collectors. That complicates processing, stealing, and termination testing somewhat, but avoids the more complicated allocation scheme for the state. One could even consider segregated queues for oop* and narrowOop*, avoiding the whole task type dispatching. I've long wondered about the tradeoffs there.

Jul 09 '24 10:07 kimbarrett

One idea that I just now had (for my particular problem in Lilliput2) is this: we only need the length-field in the from-space copy to keep the actual length of the array, so that we know when to stop, and also to restore the length in the to-space copy. We could copy the length to the from-space offset 8 (instead of 4, which is the length field with Lilliput2), and use that, instead. That should be ok, because surely we don't need to do the slicing for 0-length arrays (in which case we would stomp over the next object), and for larger arrays, we don't care about the from-space elements just like we don't care about the length.

That still doesn't permit parallelizing the processing in the evac-failure case.

Jul 09 '24 10:07 kimbarrett

About the address space concern, yes, I am aware of this. We're doing the exact same slicing in Shenandoah since forever, and so far never had a customer who would get to that limit. Our reasoning there was that if one uses this amount of Java heap, using 2x as much native memory for the GC task queues (using the fallback version of the task encoding) would probably don't matter all that much.

I would guess there aren't tons of users of Linux Large Virtual address space.

For fat tasks, I'm less concerned about the actual space impact than on the additional memory transfer costs for the normal and dominant non-array case, where we end up spending cycles copying around these completely unused words.

I'd much have allocated states or segregated queues than fat tasks.

Jul 09 '24 10:07 kimbarrett

Withdrawing this PR, I found a better solution to my problem, no need to change the array slicing in such a fundamental way.

Jul 10 '24 09:07 rkennke

Withdrawing this PR, I found a better solution to my problem, no need to change the array slicing in such a fundamental way.

Care to share?

Jul 10 '24 10:07 kimbarrett

Withdrawing this PR, I found a better solution to my problem, no need to change the array slicing in such a fundamental way.

Care to share?

Yeah, it's this commit: https://github.com/rkennke/lilliput/commit/18574c60532b14701fbcecdf24822a7fe23f760d

The idea is simply to not use the array-length in from-space for array-slicing (because it would clash with the forwarding-pointer), but instead use the 2nd word, and preserve the original array-length there. It is kinda Lilliput(2) specific, so I guess there's no reason to upstream it ahead of time. It's also still kinda quick-and-dirty. ParallelGC would have the same problem, but I've not addressed that, yet, because Parallel (Full) GC has more severe problems that I want to solve first. G1 concurrent marking should be fine, afaict (it doesn't use or override from-space length, nor does it need to deal with forwarding pointers).

Jul 10 '24 11:07 rkennke