8328866: Add raw monitor rank support to the debug agent.
This PR adds ranked monitor support to the debug agent. The debug agent has a large number of monitors, and it's really hard to know which order to grab them in, and for that matter which monitors might already be held at any given moment. By imposing a rank on each monitor, we can check to make sure they are always grabbed in the order of their rank. Having this in place when I was working on JDK-8324868 would have made it much easier to detect a deadlock that was occuring, and the reason for it. That's what motivated me to do this work
There were 2 or 3 minor rank issues discovered as a result of these changes. I also learned a lot about some of the more ugly details of the locking support in the process.
Tested with the following on all supported platforms and with virtual threads:
com/sun/jdi vmTestbase/nsk/jdi vmTestbase/nsk/jdb vmTestbase/nsk/jdwp
Still need to run tier2 and tier5.
Details of the changes follow in the first comment.
Progress
- [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
Warning
⚠️ Found trailing period in issue title for 8328866: Add raw monitor rank support to the debug agent.
Issue
- JDK-8328866: Add raw monitor rank support to the debug agent. (Enhancement - P4)
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/19044/head:pull/19044
$ git checkout pull/19044
Update a local copy of the PR:
$ git checkout pull/19044
$ git pull https://git.openjdk.org/jdk.git pull/19044/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 19044
View PR using the GUI difftool:
$ git pr show -t 19044
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/19044.diff
Webrev
:wave: Welcome back cjplummer! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
❗ This change is not yet ready to be integrated. See the Progress checklist in the description for automated requirements.
@plummercj The following label will be automatically applied to this pull request:
-
serviceability
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.
I was able to minimize changes for users of RawMonitor by keeping the same client wrapper APIs (eg. debugMonitorEnter), and for the most part passing in the same argument (the monitor to enter). The only change is that the client is passing a new type called DebugRawMonitor* instead of a jrawMonitorID. The DebugRawMonitor encapsulates the jrawMonitorID and contains other monitor info that needs to be tracked (like the rank and current owner).
Ranks were determined largely by trial and error by running tests and fixing the rank violations asserts as I hit them. In the enum that contains all the rankings, I explain the rational for each monitor's rank.
For the moment I have included a flag (rankedmonitors) to control whether or not ranked monitors should be used. This is just a safety net in case someone runs into an issue. The flag is not documented and will likely be removed at some point.
The ranked monitor supported ended up needing a RawMonitor itself (dbgRawMonitor_lock). This RawMonitor is therefore not part of the ranked monitor support. It could probably use a better name. Suggestions are welcomed.
There are a few issues I fixed while working on the ranked monitor support:
During debug agent initialization we need to call util_initialize() before commonRef_initialize() because the later creates a monitor, and that depends on util_initialize() having already been called.
getLocks() in threadControl.c wasn't grabbing locks in the right order. This could have lead to a deadlock, but seems the structure of how locks are used prevented one from happening.
In invoker_completeInvokeRequest() I had to add grabbing the stepLock in order to maintain proper lock order when getLocks() was eventually called (which also grabbed stepLock). Other than triggering a rank violation assert, this doesn't seem to have caused any real issue, but did leave the potential for a deadlock.
In handleInterrupt() I noticed that a local JNI ref was not being freed. I noticed because there were other places where I had added calls to threadControl_currentThread() that resulted in a noticeable JNI ref leak. When I fixed those, I fixed the one in handleInterrupt() too even tough it was not causing any problems.
The ranked monitor APIs needed some extra safeguards during VM exit that are not needed for the unranked monitor APIs. This is because the unranked APIs didn't have a problem with the current thread being NULL, but the ranked APIs require a valid current thread. That is why you see the following check:
if (gdata->vmDead && current_thread == NULL) {
return;
}
Webrevs
- 11: Full - Incremental (ced4e277)
- 10: Full (0cb80d4f)
- 09: Full - Incremental (185233aa)
- 08: Full (dc978d7a)
- 07: Full - Incremental (33fcea3e)
- 06: Full - Incremental (c3bd1716)
- 05: Full - Incremental (1c6a2e34)
- 04: Full - Incremental (301da7d5)
- 03: Full - Incremental (049d3884)
- 02: Full - Incremental (c782279b)
- 01: Full - Incremental (ece5f34f)
- 00: Full (bcb92b59)
Chris, I'm looking at this fix.
I added a verifyMonitorRank() when doing a debugMonitorWait() since wait does an exit and enter, so we should do the same rank check that we do during an enter. I was a bit worried that we might be doing wait() while holding higher ranking lock, especially given how getLocks() pro-actively grabs locks before they are needed, but all tests (surprisingly) passed after this change.
What do you think about the terminology usage of "rank". In this PR the term "higher ranked" means that it must be entered first (before any monitor with a lower rank), but 0 is the highest rank, and bigger numbers mean a lower rank. Sometimes when we rank things (in general) we say #1 is the best, but sometimes we instead rank with a score, and the highest score is the best. I'm wondering if for this PR the rank ordering should be reversed, so a "higher rank" is actually a higher number.
Update: Hotspot seems to use the term "higher/lowest rank" in the same way, but applies the lowest rank number to the lowest ranked mutex, so this would suggest that I should reverse the rank orders.
What do you think about the terminology usage of "rank".
I slightly prefer to have the same rank order as Hotspot uses. At the same time, I'm not sure if it won't cause any issues/difficulties. But I'd give it a try.
Note we can't check if the current thread owns a lock without grabbing dbgRawMonitor. In fact the main purpose of dbgRawMonitor is to allow the current thread to check if it owns the monitor.
That smells fishy to me. A thread can always safely check if it owns something because it can't race with itself and get a wrong answer.
That smells fishy to me. A thread can always safely check if it owns something because it can't race with itself and get a wrong answer.
Unfortunately checking ownership means comparing jobjects, which you can't safely do if you are comparing to a jobject that could be freed at any moment (I'm referring the JNI ref being freed, not the object being GC'd). I ran into asserts in JNI IsSameObject() due to this happening. Basically the local ref became invalid between the time it was fetched from the DebugRawMonitor and when is was used by IsSameObject().
Unfortunately checking ownership means comparing jobjects, which you can't safely do if you are comparing to a jobject that could be freed at any moment
Okay but how does this dbgRawMonitor lock prevent the GC from clearing a jobject?
Okay but how does this
dbgRawMonitorlock prevent the GC from clearing a jobject?
The JNI ref. It's actually a global ref (I said local ref earlier). As long as any setters of DebugRawMonitor::ownerThread (setting the thread or clearing it) grab dbgRawMonitor first, and the current thread accessing ownerThread does the same, then you know that whatever ownerThread points to is either NULL, the current thread, or some other thread that is still alive and is actually holding the monitor that the DebugRawMonitor is encapsulating. As long as the current thread continues to hold dbgRawMonitor, no other thread will be able to change ownerThread. This means that if some other thread currently owns the monitor that the DebugRawMonitor is encapsulating, then it will first block on dbgRawMonitor before trying to release it and clear ownerThread.
@plummercj This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!
keep alive
@plummercj This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!
Keep alive. I'll be continuing work on this soon.
@plummercj This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!
@plummercj This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.
/open
@plummercj This pull request is now open