8320649: C2: Optimize scoped values
This change implements C2 optimizations for calls to ScopedValue.get(). Indeed, in:
v1 = scopedValue.get();
...
v2 = scopedValue.get();
v2 can be replaced by v1 and the second call to get() can be
optimized out. That's true whatever is between the 2 calls unless a
new mapping for scopedValue is created in between (when that happens
no optimizations is performed for the method being compiled). Hoisting
a get() call out of loop for a loop invariant scopedValue should
also be legal in most cases.
ScopedValue.get() is implemented in java code as a 2 step process. A
cache is attached to the current thread object. If the ScopedValue
object is in the cache then the result from get() is read from
there. Otherwise a slow call is performed that also inserts the
mapping in the cache. The cache itself is lazily allocated. One
ScopedValue can be hashed to 2 different indexes in the cache. On a
cache probe, both indexes are checked. As a consequence, the process
of probing the cache is a multi step process (check if the cache is
present, check first index, check second index if first index
failed). If the cache is populated early on, then when the method that
calls ScopedValue.get() is compiled, profile reports the slow path
as never taken and only the read from the cache is compiled.
To perform the optimizations, I added 3 new node types to C2:
-
the pair ScopedValueGetHitsInCacheNode/ScopedValueGetLoadFromCacheNode for the cache probe
-
a cfg node ScopedValueGetResultNode to help locate the result of the
get()call in the IR graph.
In pseudo code, once the nodes are inserted, the code of a get() is:
hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue)
if (hits_in_the_cache) {
res = ScopedValueGetLoadFromCache(hits_in_the_cache);
} else {
res = ..; //slow call possibly inlined. Subgraph can be arbitray complex
}
res = ScopedValueGetResult(res)
In the snippet:
v1 = scopedValue.get();
...
v2 = scopedValue.get();
Replacing v2 by v1 is then done by starting from the
ScopedValueGetResult node for the second get() and looking for a
dominating ScopedValueGetResult for the same ScopedValue
object. When one is found, it is used as a replacement. Eliminating
the second get() call is achieved by making
ScopedValueGetHitsInCache always successful if there's a dominating
ScopedValueGetResult and replacing its companion
ScopedValueGetLoadFromCache by the dominating
ScopedValueGetResult.
Hoisting a get() out of loop is achieved by peeling one iteration of
the loop. The optimization above then finds a dominating get() and
removed the get() from the loop body.
An important case, I think, is when profile predicts the slow case to
never taken. Then the code of get() is:
hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue)
if (hits_in_the_cache) {
res = ScopedValueGetLoadFromCache(hits_in_the_cache);
} else {
trap();
}
res = ScopedValueGetResult(res)
The ScopedValueGetResult doesn't help and is removed early one. The
optimization process then looks for a pair of
ScopedValueGetHitsInCache/ScopedValueGetLoadFromCache that
dominates the current pair of
ScopedValueGetHitsInCache/ScopedValueGetLoadFromCache and can
replace them. In that case, hoisting a ScopedValue.get() can be
done by predication and I added special logic in predication for that.
Adding the new nodes to the graph when a ScopedValue.get() call is
encountered is done in several steps:
1- inlining of ScopedValue.get() is delayed and the call is enqueued
for late inlining.
2- Once the graph is fully constructed, for each call to
ScopedValue.get(), a ScopedValueGetResult is added between the
result of the call and its uses.
3- the call is then inlined by parsing the ScopedValue.get() method
4- finally the subgraph that results is pattern matched and the pieces
required to perform the cache probe are extracted and attached to new
ScopedValueGetHitsInCache/ScopedValueGetLoadFromCache nodes
There are a couple of reasons for steps 3 and 4:
-
As mentioned above probing the cache is a multi step process. Having only 2 nodes in a simple graph shape to represent it makes it easier to write robust optimizations
-
the subgraph for the method after parsing contains valuable pieces of information: profile data that captures which of the 2 locations in the cache is the most likely to causee a hit. Profile data is attached to the nodes.
Removal of redundant nodes is done during loop opts. The ScopedValue
nodes are then expanded. That also happens during loop opts because
once expansion is over, there are opportunities for further
optimizations/clean up that can only happens during loop opts. During
expansion, ScopedValueGetResult nodes are removed and
ScopedValueGetHitsInCache/ScopedValueGetLoadFromCache are expanded
to the multi step process of probing the cache. Profile data attached
to the nodes are used to assign correct frequencies/counts to the If
nodes. Of the 2 locations in the cache that are tested, the one that's
the most likely to see a hit (from profile data) is done first.
/cc hotspot-compiler
Progress
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
- [ ] Change must be properly reviewed (2 reviews required, with at least 2 Reviewers)
Issue
- JDK-8320649: C2: Optimize scoped values (Enhancement - P4)
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16966/head:pull/16966
$ git checkout pull/16966
Update a local copy of the PR:
$ git checkout pull/16966
$ git pull https://git.openjdk.org/jdk.git pull/16966/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 16966
View PR using the GUI difftool:
$ git pr show -t 16966
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16966.diff
Webrev
:wave: Welcome back roland! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
@rwestrel
The hotspot-compiler label was successfully added.
Webrevs
- 17: Full - Incremental (7723c9c7)
- 16: Full - Incremental (d38872fd)
- 15: Full - Incremental (f63bf543)
- 14: Full (a4ffc11e)
- 13: Full (1f8931d8)
- 12: Full - Incremental (142ca630)
- 11: Full (3f312f8f)
- 10: Full - Incremental (361a6ab7)
- 09: Full (57592601)
- 08: Full - Incremental (2cb19fb5)
- 07: Full (0e2e285b)
- 06: Full (5a31eeab)
- 05: Full - Incremental (a24f729f)
- 04: Full - Incremental (6e797117)
- 03: Full - Incremental (28fa7f74)
- 02: Full (c7d1fe84)
- 01: Full - Incremental (489bb2ab)
- 00: Full (bc028fbd)
No review yet, I just performed some quick testing.
Thanks for doing that. All issues should be fixed now (I couldn't reproduce the last one so not 100% sure about that one).
I'm not a C2 expert, so my high-level comments might not all make sense, but here goes.
- My first reaction was why does this need to be so complicated? Can't we treat the slow path and cache implementation as a black box, and just common up the high-level get()? In my mind the ScopedValue get should be similar to a read from a "condy" dynamic constant.
- The reason for breaking up the operations into individual nodes seems to be because of the profiling information. So I'm wondering how much this helps, given the added complexity.
- I would expect multiple get() calls in the same method or in loops to be rare and/or programmer errors. The real advantage seems to be when the binding and the get() can be connected from caller to callee through inlining.
- Are we able to optimize a get() on a constant/final ScopedValue into a simple array load at a constant offset?
- Needing to do things like treat ScopedValueGetHitsInCache as always successful give be a bad feeling for some reason, and seem unnecessary if we did more at a higher (macro?) level rather than eagerly expanding the high-level operation into individual nodes.
A couple of answers:
I'm not a C2 expert, so my high-level comments might not all make sense, but here goes.
* I would expect multiple get() calls in the same method or in loops to be rare and/or programmer errors. The real advantage seems to be when the binding and the get() can be connected from caller to callee through inlining.
Binding and get() are usually separated by a long way. It's a common pattern to use get() inside a loop when a ScopedValue is used to hold a capability object which is private within a library context.
* Are we able to optimize a get() on a constant/final ScopedValue into a simple array load at a constant offset?
Maybe I'm misunderstanding this question, but that's what the scoped value cache does.
I'm not a C2 expert, so my high-level comments might not all make sense, but here goes.
* My first reaction was why does this need to be so complicated?
That's a fair reaction.
Can't we treat the slow path and cache implementation as a black box, and just common up the high-level get()? In my mind the ScopedValue get should be similar to a read from a "condy" dynamic constant.
Initially, I thought about delaying the inlining of get() methods and simply have a pass that look for get() calls with the same inputs. I don't think that works well because the current late inlining framework can't delay inlining very late. We don't run loop opts before we're done with inlining for instance. If we wanted to hoist a call out of loop we would need loop opts. For instance, tt's likely a call to get() depends on a null check that we would need to hoist first.
The other thing about optimizing get() calls is that they are heavy weight nodes (a high level get() macro node would be very similar to a get() call node whichever way you look at it). We don't know how to hoist a call out of loop. A call acts as a barrier on the entire memory state and get in the way of memory optimizations. If profile reports the slow path to be never taken then the shape of the get() becomes lighter weight. It doesn't disrupt other optimizations. Probing the cache acts as a load + test which we know how to hoist from a loop.
It felt to me that it would be fairly common for the slow path to not be needed and given the shape without the slow path is much easier to optimize, it was important to be able to expose early on if the slow path was there or not.
* The reason for breaking up the operations into individual nodes seems to be because of the profiling information. So I'm wondering how much this helps, given the added complexity.
The thing about get() is that in simple cases, it optimizes well because of profile data. A get() call once inlined can essentially be hoisted out of loop if all goes well. It doesn't take much for simple optimizations on get() to not happen anymore. The goal of this patch is to bring consistency and have optimizations work well in all sort of scenarios. But it would be hard to sell if the simple cases don't work as well as they do without the patch. And I believe that requires profile data.
* I would expect multiple get() calls in the same method or in loops to be rare and/or programmer errors. The real advantage seems to be when the binding and the get() can be connected from caller to callee through inlining.
Eliminating get() calls with the same inputs may not be common in java code but that transformation is a building block for optimizations. Hoisting a get() out of loop can be achieved by peeling one iteration and letting the get() from the loop body be removed because it's redundant with the one from the peeled iteration. Also, code that c2 optimizes once inlining has happened and dead paths have been trimmed doesn't necessarily look like the java code the programmer wrote.
* Needing to do things like treat ScopedValueGetHitsInCache as always successful give be a bad feeling for some reason, and seem unnecessary if we did more at a higher (macro?) level rather than eagerly expanding the high-level operation into individual nodes.
I think my comments above cover that one.
Thanks @rwestrel, that helps. I have no objections to this change, but I don't understand C2 enough to do a deeper review.
Maybe I'm misunderstanding this question, but that's what the scoped value cache does.
@theRealAph I guess it boils down to whether the hash value can be treated as a compile-time constant, which seems possible because it's marked final.
Maybe I'm misunderstanding this question, but that's what the scoped value cache does.
@theRealAph I guess it boils down to whether the hash value can be treated as a compile-time constant, which seems possible because it's marked final.
It always has been in the tests I've done. One of the interesting challenges with this work has been to make sure scoped value performance doesn't regress. A great advantage of this PR is that a dedicated scoped value optimization helps to make such regressions less likely.
Tests all pass now.
Personal wishlist: can you add a case where this optimization enables vectorization? Or do your optimizations happen too late for that?
In its simplest form, a ScopedValue get will have a test (to check it's indeed in the cache) and a load so I don't think vectorization is possible.
In its simplest form, a ScopedValue get will have a test (to check it's indeed in the cache) and a load so I don't think vectorization is possible.
So you think this could not be predicated and hoisted out of the loop? That would also be a sad limitation :disappointed:
In its simplest form, a ScopedValue get will have a test (to check it's indeed in the cache) and a load so I don't think vectorization is possible.
So you think this could not be predicated and hoisted out of the loop? That would also be a sad limitation 😞
Even if that was possible, ScopedValue get loads are from the cache indexed by a hash stored as a field in the ScopedValue object. I'm not sure how you would be able to tell which of the loads from several get() calls are contiguous in memory.
In its simplest form, a ScopedValue get will have a test (to check it's indeed in the cache) and a load so I don't think vectorization is possible.
So you think this could not be predicated and hoisted out of the loop? That would also be a sad limitation 😞
Even if that was possible, ScopedValue get loads are from the cache indexed by a hash stored as a field in the ScopedValue object. I'm not sure how you would be able to tell which of the loads from several get() calls are contiguous in memory.
A simple example would be to add/multiply it to every element in an array.
Imagine we stream over a container (some array or memory segment). And some "map" method is applied to every element, which adds in the scoped value. If everything is inlined, then this might come to be a loop, where inside the loop we add every element to the scopedValue.get(), and we would hope the get floats out of the loop completely, so we can broadcast that value, and vectorize the loop.
Nice work @rwestrel I'm sending out a first batch or comments, more coming later.
Thanks for the careful review.
@rwestrel Should I look at this again, or are you still working on some parts?
@rwestrel Should I look at this again, or are you still working on some parts?
Yes, please. I reworked the change, fixed some bugs and added a vectorization test case.
Ok, this is all for today. I only looked at the first half. Maybe you can react to my comments this far, and then I can continue?
Yes, sounds good. Thanks for the new review.
Maybe a more general question: How bad would it be if we simply did not inline the "get" call, but directly transformed the graph? Is that feasible? Or would it be a performance hit because we could not inspect the profiling inside "get"? It is just a lot of code here, with lots of potential for bugs.
It's probably feasible. I think we want to avoid the slow path (when there's a missed in the cache) as much as possible because having a call or some complicated logic to handle the cache miss will affect all sort of optimizations. So we would have to speculate and then take a trap if that fails. Given we have profile data in the get(), it seems unfortunate to not use it and have it guide the decisions. The pattern matching is complicated for sure but it's also fairly mechanical. The java code being matches is fairly simple. What can affect pattern matching is whether one of the 3 branches can be never taken. There doesn't seem to be that many ways it could break.
The most fragile part I would say is that the ScopedValueGetHitsInCacheNode/ScopedValueGetLoadFromCacheNode pair have to stay together. If some transformation changes the IR so we can't go from one to the other, then expanding the ScopedValue.get() fails.
I also wonder about test coverage. How many tests do we have? Fuzzer tests currently do not use ScopedValues. So we would never get any really strange patterns.
That's also a weakness for this. Obviously, it's a new feature so there's not that much code using it. There are some functional tests and benchmarks.
@rwestrel I'm getting some test timeouts with -server -Xcomp for compiler/c2/irTests/TestScopedValue.java, on our testing infrastructure. It ran for 12min Do you get the same?
TEST FAILURE (Commit 9)
java/lang/ScopedValue/StressStackOverflow.java#no-TieredCompilation fails on linux-aarch64 (so far no other platform):
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x0000ffff9332a57c, pid=1260224, tid=1260240
#
# JRE version: Java(TM) SE Runtime Environment (23.0) (build 23-internal-2024-02-08-0653493.emanuel.peter.jdk-review)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23-internal-2024-02-08-0653493.emanuel.peter.jdk-review, mixed mode, sharing, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)
# Problematic frame:
# V [libjvm.so+0xa8657c] PhaseIdealLoop::get_early_ctrl(Node*)+0x2a8
...
Current CompileTask:
C2:196 61 ! StressStackOverflow$1::run (108 bytes)
Stack: [0x0000ffff71606000,0x0000ffff71804000], sp=0x0000ffff717feba0, free space=2018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0xa8657c] PhaseIdealLoop::get_early_ctrl(Node*)+0x2a8 (loopnode.hpp:1139)
V [libjvm.so+0xa86884] PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0x84 (loopnode.cpp:251)
V [libjvm.so+0xa868f4] PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xf4 (loopnode.cpp:277)
V [libjvm.so+0xa868f4] PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xf4 (loopnode.cpp:277)
V [libjvm.so+0xa868f4] PhaseIdealLoop::set_subtree_ctrl(Node*, bool) [clone .part.0]+0xf4 (loopnode.cpp:277)
V [libjvm.so+0xa8e89c] PhaseIdealLoop::test_and_load_from_cache(Node*, Node*, Node*, Node*, float, float, Node*, Node*&, Node*&, Node*&)+0x2fc (loopnode.cpp:5158)
V [libjvm.so+0xa8f168] PhaseIdealLoop::expand_sv_get_hits_in_cache_and_load_from_cache(ScopedValueGetHitsInCacheNode*)+0x558 (loopnode.cpp:5068)
V [libjvm.so+0xa8f528] PhaseIdealLoop::expand_scoped_value_get_nodes()+0x48 (loopnode.cpp:4934)
V [libjvm.so+0xa9ae30] PhaseIdealLoop::build_and_optimize()+0xd70 (loopnode.cpp:4875)
V [libjvm.so+0x59a284] Compile::optimize_loops(PhaseIterGVN&, LoopOptsMode)+0x284 (loopnode.hpp:1113)
V [libjvm.so+0x5a44e8] Compile::Optimize()+0x568 (compile.cpp:2519)
V [libjvm.so+0x5a5834] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0xbf4 (compile.cpp:864)
V [libjvm.so+0x4d516c] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x138 (c2compiler.cpp:142)
V [libjvm.so+0x5ab468] CompileBroker::invoke_compiler_on_method(CompileTask*)+0x998 (compileBroker.cpp:2305)
V [libjvm.so+0x5adf14] CompileBroker::compiler_thread_loop()+0x374 (compileBroker.cpp:1964)
V [libjvm.so+0x821e74] JavaThread::thread_main_inner() [clone .part.0]+0xa0 (javaThread.cpp:721)
V [libjvm.so+0xd176a8] Thread::call_run()+0xa8 (thread.cpp:221)
V [libjvm.so+0xba2338] thread_native_entry(Thread*)+0xd8 (os_linux.cpp:817)
C [libpthread.so.0+0x7928] start_thread+0x188
@rwestrel this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:
git checkout JDK-8320649
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push
TEST FAILURE (Commit 9)
java/lang/ScopedValue/StressStackOverflow.java#no-TieredCompilationfails onlinux-aarch64(so far no other platform):
I've not been available to reproduce that one. Is there a replay file?
For java/lang/ScopedValue/StressStackOverflow.java#no-TieredCompilation, no extra flags on linux-aarch64 (product build):
test_jdk_tier1_part1_java_lang_ScopedValue_StressStackOverflow_no-TieredCompilation_replay_pid1260224.log
And a second failure:
serviceability/jvmti/vthread/SuspendWithInterruptLock/SuspendWithInterruptLock.java#default, with flags:
-Xcomp -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation on linux-aarch64-debug. No replay since it does not seem to happen during compilation.
# Internal Error (/opt/mach5/mesos/work_dir/slaves/0db9c48f-6638-40d0-9a4b-bd9cc7533eb8-S9925/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/88e53a9e-90ec-4ff7-a48b-f6334fa445de/runs/40b98fc8-a415-4425-860b-fbe1fb83e170/workspace/open/src/hotspot/share/prims/jvmtiEnvBase.cpp:1731), pid=1617205, tid=1617312
# assert(!java_thread->is_in_VTMS_transition()) failed: sanity check
Current thread (0x0000ffff4c08ff10): JavaThread "Thread-1" [_thread_in_vm, id=1617312, stack(0x0000ffff809c5000,0x0000ffff80bc3000) (2040K)] _threads_hazard_ptr=0x0000ffff4c091310, _nested_threads_hazard_ptr_cnt=0
Stack: [0x0000ffff809c5000,0x0000ffff80bc3000], sp=0x0000ffff80bc1020, free space=2032k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x1038de8] JvmtiEnvBase::suspend_thread(oop, JavaThread*, bool, int*)+0x528 (jvmtiEnvBase.cpp:1731)
V [libjvm.so+0x10227f8] JvmtiEnv::SuspendThread(_jobject*)+0xb8 (jvmtiEnv.cpp:948)
V [libjvm.so+0xfe0504] jvmti_SuspendThread+0x194 (jvmtiEnter.cpp:534)
J 2631 jvmti.JVMTIUtils.suspendThread0(Ljava/lang/Thread;)I (0 bytes) @ 0x0000ffff953e4078 [0x0000ffff953e3fc0+0x00000000000000b8]
J 2627 c2 jvmti.JVMTIUtils.suspendThread(Ljava/lang/Thread;)V (22 bytes) @ 0x0000ffff953e8310 [0x0000ffff953e82c0+0x0000000000000050]
j SuspendWithInterruptLock.suspendThread(Ljava/lang/Thread;)V+1
j SuspendWithInterruptLock.suspender(Ljava/lang/Thread;)V+7
j SuspendWithInterruptLock.lambda$main$2(Ljava/lang/Thread;)V+1
J 2582 c2 SuspendWithInterruptLock$$Lambda+0x0000000201003458.run()V (8 bytes) @ 0x0000ffff953d2dac [0x0000ffff953d2d40+0x000000000000006c]
J 1763 c2 java.lang.Thread.run()V java.base@23-internal (23 bytes) @ 0x0000ffff951461e8 [0x0000ffff95146140+0x00000000000000a8]
v ~StubRoutines::call_stub 0x0000ffff94c1c190
V [libjvm.so+0xd49374] JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x464 (javaCalls.cpp:415)
V [libjvm.so+0xd4993c] JavaCalls::call_virtual(JavaValue*, Klass*, Symbol*, Symbol*, JavaCallArguments*, JavaThread*)+0x29c (javaCalls.cpp:329)
V [libjvm.so+0xd49b3c] JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, JavaThread*)+0x6c (javaCalls.cpp:191)
V [libjvm.so+0xea4990] thread_entry(JavaThread*, JavaThread*)+0xa0 (jvm.cpp:2937)
V [libjvm.so+0xd7d71c] JavaThread::thread_main_inner()+0xcc (javaThread.cpp:721)
V [libjvm.so+0x15beec0] Thread::call_run()+0xac (thread.cpp:221)
V [libjvm.so+0x1333bcc] thread_native_entry(Thread*)+0x12c (os_linux.cpp:817)
C [libpthread.so.0+0x7928] start_thread+0x188
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 2631 jvmti.JVMTIUtils.suspendThread0(Ljava/lang/Thread;)I (0 bytes) @ 0x0000ffff953e4078 [0x0000ffff953e3fc0+0x00000000000000b8]
J 2627 c2 jvmti.JVMTIUtils.suspendThread(Ljava/lang/Thread;)V (22 bytes) @ 0x0000ffff953e8310 [0x0000ffff953e82c0+0x0000000000000050]
j SuspendWithInterruptLock.suspendThread(Ljava/lang/Thread;)V+1
j SuspendWithInterruptLock.suspender(Ljava/lang/Thread;)V+7
j SuspendWithInterruptLock.lambda$main$2(Ljava/lang/Thread;)V+1
J 2582 c2 SuspendWithInterruptLock$$Lambda+0x0000000201003458.run()V (8 bytes) @ 0x0000ffff953d2dac [0x0000ffff953d2d40+0x000000000000006c]
J 1763 c2 java.lang.Thread.run()V java.base@23-internal (23 bytes) @ 0x0000ffff951461e8 [0x0000ffff95146140+0x00000000000000a8]
v ~StubRoutines::call_stub 0x0000ffff94c1c190
For
java/lang/ScopedValue/StressStackOverflow.java#no-TieredCompilation, no extra flags onlinux-aarch64(product build): test_jdk_tier1_part1_java_lang_ScopedValue_StressStackOverflow_no-TieredCompilation_replay_pid1260224.log
I fixed that one and added a test case for it.
And a second failure:
serviceability/jvmti/vthread/SuspendWithInterruptLock/SuspendWithInterruptLock.java#default, with flags:-Xcomp -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilationonlinux-aarch64-debug. No replay since it does not seem to happen during compilation.
This one doesn't seem related. I couldn't reproduce it. Can you reproduce it? Could it be a spurious failure that just happened to show up on your test run?
Ok, this is all for today. I only looked at the first half. Maybe you can react to my comments this far, and then I can continue?
I pushed a new commit that addresses your comments.
@rwestrel great that you fixed the first one! The second one is indeed unrelated: https://bugs.openjdk.org/browse/JDK-8312064
Thanks for all the updates. I might not get to it today, but hopefully next week.
@rwestrel Would you mind resolving the merge conflicts? I don't want to be reviewing and then all the comments disappear suddently :sweat_smile:
Ah, one more thing: what about a JMH benchmark where you can show off how much this optimization improves runtime? ;)
Ah, one more thing: what about a JMH benchmark where you can show off how much this optimization improves runtime? ;)
We already have benchmarks, but the biggest win due to this change is the opportunity to reduce the load on the scoped value cache.
At present, high performance depends on the per-thread cache, which is a 16-element OOP array. This is a fairly heavyweight structure for virtual threads, which otherwise have a very small heap footprint. With this optimization I think I can shrink the cache without significant loss of performance in most cases. I might also be able to move this cache to the carrier thread.
So, this patch significantly moves the balance point in the space/speed tradeoff.
@theRealAph that is exciting! It's a bit scary to have over 2000 lines for this optimization, makes it quite hard to review. But let's keep working on it :blush: