jdk 8320649: C2: Optimize scoped values

This change implements C2 optimizations for calls to ScopedValue.get(). Indeed, in:

v1 = scopedValue.get();
...
v2 = scopedValue.get();

v2 can be replaced by v1 and the second call to get() can be optimized out. That's true whatever is between the 2 calls unless a new mapping for scopedValue is created in between (when that happens no optimizations is performed for the method being compiled). Hoisting a get() call out of loop for a loop invariant scopedValue should also be legal in most cases.

ScopedValue.get() is implemented in java code as a 2 step process. A cache is attached to the current thread object. If the ScopedValue object is in the cache then the result from get() is read from there. Otherwise a slow call is performed that also inserts the mapping in the cache. The cache itself is lazily allocated. One ScopedValue can be hashed to 2 different indexes in the cache. On a cache probe, both indexes are checked. As a consequence, the process of probing the cache is a multi step process (check if the cache is present, check first index, check second index if first index failed). If the cache is populated early on, then when the method that calls ScopedValue.get() is compiled, profile reports the slow path as never taken and only the read from the cache is compiled.

To perform the optimizations, I added 3 new node types to C2:

the pair ScopedValueGetHitsInCacheNode/ScopedValueGetLoadFromCacheNode for the cache probe
a cfg node ScopedValueGetResultNode to help locate the result of the get() call in the IR graph.

In pseudo code, once the nodes are inserted, the code of a get() is:

hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue)
if (hits_in_the_cache) {
  res = ScopedValueGetLoadFromCache(hits_in_the_cache);
} else {
  res = ..; //slow call possibly inlined. Subgraph can be arbitray complex
}
res = ScopedValueGetResult(res)

In the snippet:

v1 = scopedValue.get();
...
v2 = scopedValue.get();

Replacing v2 by v1 is then done by starting from the ScopedValueGetResult node for the second get() and looking for a dominating ScopedValueGetResult for the same ScopedValue object. When one is found, it is used as a replacement. Eliminating the second get() call is achieved by making ScopedValueGetHitsInCache always successful if there's a dominating ScopedValueGetResult and replacing its companion ScopedValueGetLoadFromCache by the dominating ScopedValueGetResult.

Hoisting a get() out of loop is achieved by peeling one iteration of the loop. The optimization above then finds a dominating get() and removed the get() from the loop body.

An important case, I think, is when profile predicts the slow case to never taken. Then the code of get() is:

hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue)
if (hits_in_the_cache) {
  res = ScopedValueGetLoadFromCache(hits_in_the_cache);
} else {
  trap();
}
res = ScopedValueGetResult(res)

The ScopedValueGetResult doesn't help and is removed early one. The optimization process then looks for a pair of ScopedValueGetHitsInCache/ScopedValueGetLoadFromCache that dominates the current pair of ScopedValueGetHitsInCache/ScopedValueGetLoadFromCache and can replace them. In that case, hoisting a ScopedValue.get() can be done by predication and I added special logic in predication for that.

Adding the new nodes to the graph when a ScopedValue.get() call is encountered is done in several steps:

1- inlining of ScopedValue.get() is delayed and the call is enqueued for late inlining.

2- Once the graph is fully constructed, for each call to ScopedValue.get(), a ScopedValueGetResult is added between the result of the call and its uses.

3- the call is then inlined by parsing the ScopedValue.get() method

4- finally the subgraph that results is pattern matched and the pieces required to perform the cache probe are extracted and attached to new ScopedValueGetHitsInCache/ScopedValueGetLoadFromCache nodes

There are a couple of reasons for steps 3 and 4:

As mentioned above probing the cache is a multi step process. Having only 2 nodes in a simple graph shape to represent it makes it easier to write robust optimizations
the subgraph for the method after parsing contains valuable pieces of information: profile data that captures which of the 2 locations in the cache is the most likely to causee a hit. Profile data is attached to the nodes.

Removal of redundant nodes is done during loop opts. The ScopedValue nodes are then expanded. That also happens during loop opts because once expansion is over, there are opportunities for further optimizations/clean up that can only happens during loop opts. During expansion, ScopedValueGetResult nodes are removed and ScopedValueGetHitsInCache/ScopedValueGetLoadFromCache are expanded to the multi step process of probing the cache. Profile data attached to the nodes are used to assign correct frequencies/counts to the If nodes. Of the 2 locations in the cache that are tested, the one that's the most likely to see a hit (from profile data) is done first.

/cc hotspot-compiler

Progress

[x] Change must not contain extraneous whitespace
[x] Commit message must refer to an issue
[ ] Change must be properly reviewed (2 reviews required, with at least 2 Reviewers)

Issue

JDK-8320649: C2: Optimize scoped values (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16966/head:pull/16966
$ git checkout pull/16966

Update a local copy of the PR:
$ git checkout pull/16966
$ git pull https://git.openjdk.org/jdk.git pull/16966/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 16966

View PR using the GUI difftool:
$ git pr show -t 16966

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16966.diff

Webrev

Link to Webrev Comment

Dec 05 '23 08:12 rwestrel

:wave: Welcome back roland! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

Dec 05 '23 08:12 bridgekeeper[bot]

@rwestrel The hotspot-compiler label was successfully added.

Dec 05 '23 08:12 openjdk[bot]

Webrevs

Dec 05 '23 08:12 mlbridge[bot]

No review yet, I just performed some quick testing.

Thanks for doing that. All issues should be fixed now (I couldn't reproduce the last one so not 100% sure about that one).

Dec 07 '23 14:12 rwestrel

I'm not a C2 expert, so my high-level comments might not all make sense, but here goes.

My first reaction was why does this need to be so complicated? Can't we treat the slow path and cache implementation as a black box, and just common up the high-level get()? In my mind the ScopedValue get should be similar to a read from a "condy" dynamic constant.
The reason for breaking up the operations into individual nodes seems to be because of the profiling information. So I'm wondering how much this helps, given the added complexity.
I would expect multiple get() calls in the same method or in loops to be rare and/or programmer errors. The real advantage seems to be when the binding and the get() can be connected from caller to callee through inlining.
Are we able to optimize a get() on a constant/final ScopedValue into a simple array load at a constant offset?
Needing to do things like treat ScopedValueGetHitsInCache as always successful give be a bad feeling for some reason, and seem unnecessary if we did more at a higher (macro?) level rather than eagerly expanding the high-level operation into individual nodes.

Jan 05 '24 00:01 dean-long

A couple of answers:

I'm not a C2 expert, so my high-level comments might not all make sense, but here goes.

* I would expect multiple get() calls in the same method or in loops to be rare and/or programmer errors.  The real advantage seems to be when the binding and the get() can be connected from caller to callee through inlining.

Binding and get() are usually separated by a long way. It's a common pattern to use get() inside a loop when a ScopedValue is used to hold a capability object which is private within a library context.

* Are we able to optimize a get() on a constant/final ScopedValue into a simple array load at a constant offset?

Maybe I'm misunderstanding this question, but that's what the scoped value cache does.

Jan 05 '24 10:01 theRealAph

I'm not a C2 expert, so my high-level comments might not all make sense, but here goes.
* My first reaction was why does this need to be so complicated?

That's a fair reaction.

Can't we treat the slow path and cache implementation as a black box, and just common up the high-level get()? In my mind the ScopedValue get should be similar to a read from a "condy" dynamic constant.

Initially, I thought about delaying the inlining of get() methods and simply have a pass that look for get() calls with the same inputs. I don't think that works well because the current late inlining framework can't delay inlining very late. We don't run loop opts before we're done with inlining for instance. If we wanted to hoist a call out of loop we would need loop opts. For instance, tt's likely a call to get() depends on a null check that we would need to hoist first.

The other thing about optimizing get() calls is that they are heavy weight nodes (a high level get() macro node would be very similar to a get() call node whichever way you look at it). We don't know how to hoist a call out of loop. A call acts as a barrier on the entire memory state and get in the way of memory optimizations. If profile reports the slow path to be never taken then the shape of the get() becomes lighter weight. It doesn't disrupt other optimizations. Probing the cache acts as a load + test which we know how to hoist from a loop.

It felt to me that it would be fairly common for the slow path to not be needed and given the shape without the slow path is much easier to optimize, it was important to be able to expose early on if the slow path was there or not.

* The reason for breaking up the operations into individual nodes seems to be because of the profiling information.  So I'm wondering how much this helps, given the added complexity.

The thing about get() is that in simple cases, it optimizes well because of profile data. A get() call once inlined can essentially be hoisted out of loop if all goes well. It doesn't take much for simple optimizations on get() to not happen anymore. The goal of this patch is to bring consistency and have optimizations work well in all sort of scenarios. But it would be hard to sell if the simple cases don't work as well as they do without the patch. And I believe that requires profile data.

* I would expect multiple get() calls in the same method or in loops to be rare and/or programmer errors.  The real advantage seems to be when the binding and the get() can be connected from caller to callee through inlining.

Eliminating get() calls with the same inputs may not be common in java code but that transformation is a building block for optimizations. Hoisting a get() out of loop can be achieved by peeling one iteration and letting the get() from the loop body be removed because it's redundant with the one from the peeled iteration. Also, code that c2 optimizes once inlining has happened and dead paths have been trimmed doesn't necessarily look like the java code the programmer wrote.

* Needing to do things like treat ScopedValueGetHitsInCache as always successful give be a bad feeling for some reason, and seem unnecessary if we did more at a higher (macro?) level rather than eagerly expanding the high-level operation into individual nodes.

I think my comments above cover that one.

Jan 05 '24 12:01 rwestrel

Thanks @rwestrel, that helps. I have no objections to this change, but I don't understand C2 enough to do a deeper review.

Jan 05 '24 23:01 dean-long

Maybe I'm misunderstanding this question, but that's what the scoped value cache does.

@theRealAph I guess it boils down to whether the hash value can be treated as a compile-time constant, which seems possible because it's marked final.

Jan 05 '24 23:01 dean-long

Maybe I'm misunderstanding this question, but that's what the scoped value cache does.

@theRealAph I guess it boils down to whether the hash value can be treated as a compile-time constant, which seems possible because it's marked final.

It always has been in the tests I've done. One of the interesting challenges with this work has been to make sure scoped value performance doesn't regress. A great advantage of this PR is that a dedicated scoped value optimization helps to make such regressions less likely.

Jan 06 '24 10:01 theRealAph

Tests all pass now.

Jan 08 '24 10:01 TobiHartmann

Personal wishlist: can you add a case where this optimization enables vectorization? Or do your optimizations happen too late for that?

In its simplest form, a ScopedValue get will have a test (to check it's indeed in the cache) and a load so I don't think vectorization is possible.

Jan 25 '24 14:01 rwestrel

In its simplest form, a ScopedValue get will have a test (to check it's indeed in the cache) and a load so I don't think vectorization is possible.

So you think this could not be predicated and hoisted out of the loop? That would also be a sad limitation :disappointed:

Jan 25 '24 14:01 eme64

In its simplest form, a ScopedValue get will have a test (to check it's indeed in the cache) and a load so I don't think vectorization is possible.

So you think this could not be predicated and hoisted out of the loop? That would also be a sad limitation 😞

Even if that was possible, ScopedValue get loads are from the cache indexed by a hash stored as a field in the ScopedValue object. I'm not sure how you would be able to tell which of the loads from several get() calls are contiguous in memory.

Jan 25 '24 14:01 rwestrel

In its simplest form, a ScopedValue get will have a test (to check it's indeed in the cache) and a load so I don't think vectorization is possible.

So you think this could not be predicated and hoisted out of the loop? That would also be a sad limitation 😞

Even if that was possible, ScopedValue get loads are from the cache indexed by a hash stored as a field in the ScopedValue object. I'm not sure how you would be able to tell which of the loads from several get() calls are contiguous in memory.

A simple example would be to add/multiply it to every element in an array.

Imagine we stream over a container (some array or memory segment). And some "map" method is applied to every element, which adds in the scoped value. If everything is inlined, then this might come to be a loop, where inside the loop we add every element to the scopedValue.get(), and we would hope the get floats out of the loop completely, so we can broadcast that value, and vectorize the loop.

Jan 25 '24 15:01 eme64

Nice work @rwestrel I'm sending out a first batch or comments, more coming later.

Thanks for the careful review.

Jan 25 '24 16:01 rwestrel

@rwestrel Should I look at this again, or are you still working on some parts?

Feb 06 '24 17:02 eme64

@rwestrel Should I look at this again, or are you still working on some parts?

Yes, please. I reworked the change, fixed some bugs and added a vectorization test case.

Feb 07 '24 08:02 rwestrel

Ok, this is all for today. I only looked at the first half. Maybe you can react to my comments this far, and then I can continue?

Yes, sounds good. Thanks for the new review.

Maybe a more general question: How bad would it be if we simply did not inline the "get" call, but directly transformed the graph? Is that feasible? Or would it be a performance hit because we could not inspect the profiling inside "get"? It is just a lot of code here, with lots of potential for bugs.

It's probably feasible. I think we want to avoid the slow path (when there's a missed in the cache) as much as possible because having a call or some complicated logic to handle the cache miss will affect all sort of optimizations. So we would have to speculate and then take a trap if that fails. Given we have profile data in the get(), it seems unfortunate to not use it and have it guide the decisions. The pattern matching is complicated for sure but it's also fairly mechanical. The java code being matches is fairly simple. What can affect pattern matching is whether one of the 3 branches can be never taken. There doesn't seem to be that many ways it could break.

The most fragile part I would say is that the ScopedValueGetHitsInCacheNode/ScopedValueGetLoadFromCacheNode pair have to stay together. If some transformation changes the IR so we can't go from one to the other, then expanding the ScopedValue.get() fails.

I also wonder about test coverage. How many tests do we have? Fuzzer tests currently do not use ScopedValues. So we would never get any really strange patterns.

That's also a weakness for this. Obviously, it's a new feature so there's not that much code using it. There are some functional tests and benchmarks.

Feb 08 '24 15:02 rwestrel

@rwestrel I'm getting some test timeouts with -server -Xcomp for compiler/c2/irTests/TestScopedValue.java, on our testing infrastructure. It ran for 12min Do you get the same?

TEST FAILURE (Commit 9) java/lang/ScopedValue/StressStackOverflow.java#no-TieredCompilation fails on linux-aarch64 (so far no other platform):

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000ffff9332a57c, pid=1260224, tid=1260240
#
# JRE version: Java(TM) SE Runtime Environment (23.0) (build 23-internal-2024-02-08-0653493.emanuel.peter.jdk-review)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23-internal-2024-02-08-0653493.emanuel.peter.jdk-review, mixed mode, sharing, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)
# Problematic frame:
# V  [libjvm.so+0xa8657c]  PhaseIdealLoop::get_early_ctrl(Node*)+0x2a8

...

Current CompileTask:
C2:196   61   !         StressStackOverflow$1::run (108 bytes)

Stack: [0x0000ffff71606000,0x0000ffff71804000],  sp=0x0000ffff717feba0,  free space=2018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xa8657c]  PhaseIdealLoop::get_early_ctrl(Node*)+0x2a8  (loopnode.hpp:1139)
V  [libjvm.so+0xa86884]  PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0x84  (loopnode.cpp:251)
V  [libjvm.so+0xa868f4]  PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xf4  (loopnode.cpp:277)
V  [libjvm.so+0xa868f4]  PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xf4  (loopnode.cpp:277)
V  [libjvm.so+0xa868f4]  PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xf4  (loopnode.cpp:277)
V  [libjvm.so+0xa8e89c]  PhaseIdealLoop::test_and_load_from_cache(Node*, Node*, Node*, Node*, float, float, Node*, Node*&, Node*&, Node*&)+0x2fc  (loopnode.cpp:5158)
V  [libjvm.so+0xa8f168]  PhaseIdealLoop::expand_sv_get_hits_in_cache_and_load_from_cache(ScopedValueGetHitsInCacheNode*)+0x558  (loopnode.cpp:5068)
V  [libjvm.so+0xa8f528]  PhaseIdealLoop::expand_scoped_value_get_nodes()+0x48  (loopnode.cpp:4934)
V  [libjvm.so+0xa9ae30]  PhaseIdealLoop::build_and_optimize()+0xd70  (loopnode.cpp:4875)
V  [libjvm.so+0x59a284]  Compile::optimize_loops(PhaseIterGVN&, LoopOptsMode)+0x284  (loopnode.hpp:1113)
V  [libjvm.so+0x5a44e8]  Compile::Optimize()+0x568  (compile.cpp:2519)
V  [libjvm.so+0x5a5834]  Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0xbf4  (compile.cpp:864)
V  [libjvm.so+0x4d516c]  C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x138  (c2compiler.cpp:142)
V  [libjvm.so+0x5ab468]  CompileBroker::invoke_compiler_on_method(CompileTask*)+0x998  (compileBroker.cpp:2305)
V  [libjvm.so+0x5adf14]  CompileBroker::compiler_thread_loop()+0x374  (compileBroker.cpp:1964)
V  [libjvm.so+0x821e74]  JavaThread::thread_main_inner() [clone .part.0]+0xa0  (javaThread.cpp:721)
V  [libjvm.so+0xd176a8]  Thread::call_run()+0xa8  (thread.cpp:221)
V  [libjvm.so+0xba2338]  thread_native_entry(Thread*)+0xd8  (os_linux.cpp:817)
C  [libpthread.so.0+0x7928]  start_thread+0x188

Feb 10 '24 21:02 eme64

@rwestrel this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8320649
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

Feb 13 '24 10:02 openjdk[bot]

TEST FAILURE (Commit 9) java/lang/ScopedValue/StressStackOverflow.java#no-TieredCompilation fails on linux-aarch64 (so far no other platform):

I've not been available to reproduce that one. Is there a replay file?

Feb 13 '24 15:02 rwestrel

For java/lang/ScopedValue/StressStackOverflow.java#no-TieredCompilation, no extra flags on linux-aarch64 (product build): test_jdk_tier1_part1_java_lang_ScopedValue_StressStackOverflow_no-TieredCompilation_replay_pid1260224.log

And a second failure: serviceability/jvmti/vthread/SuspendWithInterruptLock/SuspendWithInterruptLock.java#default, with flags: -Xcomp -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation on linux-aarch64-debug. No replay since it does not seem to happen during compilation.

#  Internal Error (/opt/mach5/mesos/work_dir/slaves/0db9c48f-6638-40d0-9a4b-bd9cc7533eb8-S9925/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/88e53a9e-90ec-4ff7-a48b-f6334fa445de/runs/40b98fc8-a415-4425-860b-fbe1fb83e170/workspace/open/src/hotspot/share/prims/jvmtiEnvBase.cpp:1731), pid=1617205, tid=1617312
#  assert(!java_thread->is_in_VTMS_transition()) failed: sanity check

Current thread (0x0000ffff4c08ff10):  JavaThread "Thread-1"         [_thread_in_vm, id=1617312, stack(0x0000ffff809c5000,0x0000ffff80bc3000) (2040K)] _threads_hazard_ptr=0x0000ffff4c091310, _nested_threads_hazard_ptr_cnt=0

Stack: [0x0000ffff809c5000,0x0000ffff80bc3000],  sp=0x0000ffff80bc1020,  free space=2032k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x1038de8]  JvmtiEnvBase::suspend_thread(oop, JavaThread*, bool, int*)+0x528  (jvmtiEnvBase.cpp:1731)
V  [libjvm.so+0x10227f8]  JvmtiEnv::SuspendThread(_jobject*)+0xb8  (jvmtiEnv.cpp:948)
V  [libjvm.so+0xfe0504]  jvmti_SuspendThread+0x194  (jvmtiEnter.cpp:534)
J 2631  jvmti.JVMTIUtils.suspendThread0(Ljava/lang/Thread;)I (0 bytes) @ 0x0000ffff953e4078 [0x0000ffff953e3fc0+0x00000000000000b8]
J 2627 c2 jvmti.JVMTIUtils.suspendThread(Ljava/lang/Thread;)V (22 bytes) @ 0x0000ffff953e8310 [0x0000ffff953e82c0+0x0000000000000050]
j  SuspendWithInterruptLock.suspendThread(Ljava/lang/Thread;)V+1
j  SuspendWithInterruptLock.suspender(Ljava/lang/Thread;)V+7
j  SuspendWithInterruptLock.lambda$main$2(Ljava/lang/Thread;)V+1
J 2582 c2 SuspendWithInterruptLock$$Lambda+0x0000000201003458.run()V (8 bytes) @ 0x0000ffff953d2dac [0x0000ffff953d2d40+0x000000000000006c]
J 1763 c2 java.lang.Thread.run()V java.base@23-internal (23 bytes) @ 0x0000ffff951461e8 [0x0000ffff95146140+0x00000000000000a8]
v  ~StubRoutines::call_stub 0x0000ffff94c1c190
V  [libjvm.so+0xd49374]  JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x464  (javaCalls.cpp:415)
V  [libjvm.so+0xd4993c]  JavaCalls::call_virtual(JavaValue*, Klass*, Symbol*, Symbol*, JavaCallArguments*, JavaThread*)+0x29c  (javaCalls.cpp:329)
V  [libjvm.so+0xd49b3c]  JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, JavaThread*)+0x6c  (javaCalls.cpp:191)
V  [libjvm.so+0xea4990]  thread_entry(JavaThread*, JavaThread*)+0xa0  (jvm.cpp:2937)
V  [libjvm.so+0xd7d71c]  JavaThread::thread_main_inner()+0xcc  (javaThread.cpp:721)
V  [libjvm.so+0x15beec0]  Thread::call_run()+0xac  (thread.cpp:221)
V  [libjvm.so+0x1333bcc]  thread_native_entry(Thread*)+0x12c  (os_linux.cpp:817)
C  [libpthread.so.0+0x7928]  start_thread+0x188
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 2631  jvmti.JVMTIUtils.suspendThread0(Ljava/lang/Thread;)I (0 bytes) @ 0x0000ffff953e4078 [0x0000ffff953e3fc0+0x00000000000000b8]
J 2627 c2 jvmti.JVMTIUtils.suspendThread(Ljava/lang/Thread;)V (22 bytes) @ 0x0000ffff953e8310 [0x0000ffff953e82c0+0x0000000000000050]
j  SuspendWithInterruptLock.suspendThread(Ljava/lang/Thread;)V+1
j  SuspendWithInterruptLock.suspender(Ljava/lang/Thread;)V+7
j  SuspendWithInterruptLock.lambda$main$2(Ljava/lang/Thread;)V+1
J 2582 c2 SuspendWithInterruptLock$$Lambda+0x0000000201003458.run()V (8 bytes) @ 0x0000ffff953d2dac [0x0000ffff953d2d40+0x000000000000006c]
J 1763 c2 java.lang.Thread.run()V java.base@23-internal (23 bytes) @ 0x0000ffff951461e8 [0x0000ffff95146140+0x00000000000000a8]
v  ~StubRoutines::call_stub 0x0000ffff94c1c190

Feb 13 '24 17:02 eme64

For java/lang/ScopedValue/StressStackOverflow.java#no-TieredCompilation, no extra flags on linux-aarch64 (product build): test_jdk_tier1_part1_java_lang_ScopedValue_StressStackOverflow_no-TieredCompilation_replay_pid1260224.log

I fixed that one and added a test case for it.

And a second failure: serviceability/jvmti/vthread/SuspendWithInterruptLock/SuspendWithInterruptLock.java#default, with flags: -Xcomp -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation on linux-aarch64-debug. No replay since it does not seem to happen during compilation.

This one doesn't seem related. I couldn't reproduce it. Can you reproduce it? Could it be a spurious failure that just happened to show up on your test run?

Feb 15 '24 16:02 rwestrel

Ok, this is all for today. I only looked at the first half. Maybe you can react to my comments this far, and then I can continue?

I pushed a new commit that addresses your comments.

Feb 15 '24 16:02 rwestrel

@rwestrel great that you fixed the first one! The second one is indeed unrelated: https://bugs.openjdk.org/browse/JDK-8312064

Thanks for all the updates. I might not get to it today, but hopefully next week.

Feb 16 '24 08:02 eme64

@rwestrel Would you mind resolving the merge conflicts? I don't want to be reviewing and then all the comments disappear suddently :sweat_smile:

Feb 26 '24 16:02 eme64

Ah, one more thing: what about a JMH benchmark where you can show off how much this optimization improves runtime? ;)

Feb 26 '24 16:02 eme64

Ah, one more thing: what about a JMH benchmark where you can show off how much this optimization improves runtime? ;)

We already have benchmarks, but the biggest win due to this change is the opportunity to reduce the load on the scoped value cache.

At present, high performance depends on the per-thread cache, which is a 16-element OOP array. This is a fairly heavyweight structure for virtual threads, which otherwise have a very small heap footprint. With this optimization I think I can shrink the cache without significant loss of performance in most cases. I might also be able to move this cache to the carrier thread.

So, this patch significantly moves the balance point in the space/speed tradeoff.

Feb 29 '24 16:02 theRealAph

@theRealAph that is exciting! It's a bit scary to have over 2000 lines for this optimization, makes it quite hard to review. But let's keep working on it :blush:

Mar 04 '24 09:03 eme64