jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8328866: Add raw monitor rank support to the debug agent.

Open plummercj opened this issue 1 year ago • 9 comments

This PR adds ranked monitor support to the debug agent. The debug agent has a large number of monitors, and it's really hard to know which order to grab them in, and for that matter which monitors might already be held at any given moment. By imposing a rank on each monitor, we can check to make sure they are always grabbed in the order of their rank. Having this in place when I was working on JDK-8324868 would have made it much easier to detect a deadlock that was occuring, and the reason for it. That's what motivated me to do this work

There were 2 or 3 minor rank issues discovered as a result of these changes. I also learned a lot about some of the more ugly details of the locking support in the process.

Tested with the following on all supported platforms and with virtual threads:

com/sun/jdi vmTestbase/nsk/jdi vmTestbase/nsk/jdb vmTestbase/nsk/jdwp

Still need to run tier2 and tier5.

Details of the changes follow in the first comment.


Progress

  • [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue

Warning

 ⚠️ Found trailing period in issue title for 8328866: Add raw monitor rank support to the debug agent.

Issue

  • JDK-8328866: Add raw monitor rank support to the debug agent. (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/19044/head:pull/19044
$ git checkout pull/19044

Update a local copy of the PR:
$ git checkout pull/19044
$ git pull https://git.openjdk.org/jdk.git pull/19044/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 19044

View PR using the GUI difftool:
$ git pr show -t 19044

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/19044.diff

Webrev

Link to Webrev Comment

plummercj avatar May 01 '24 21:05 plummercj

:wave: Welcome back cjplummer! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar May 01 '24 21:05 bridgekeeper[bot]

❗ This change is not yet ready to be integrated. See the Progress checklist in the description for automated requirements.

openjdk[bot] avatar May 01 '24 21:05 openjdk[bot]

@plummercj The following label will be automatically applied to this pull request:

  • serviceability

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar May 01 '24 21:05 openjdk[bot]

I was able to minimize changes for users of RawMonitor by keeping the same client wrapper APIs (eg. debugMonitorEnter), and for the most part passing in the same argument (the monitor to enter). The only change is that the client is passing a new type called DebugRawMonitor* instead of a jrawMonitorID. The DebugRawMonitor encapsulates the jrawMonitorID and contains other monitor info that needs to be tracked (like the rank and current owner).

Ranks were determined largely by trial and error by running tests and fixing the rank violations asserts as I hit them. In the enum that contains all the rankings, I explain the rational for each monitor's rank.

For the moment I have included a flag (rankedmonitors) to control whether or not ranked monitors should be used. This is just a safety net in case someone runs into an issue. The flag is not documented and will likely be removed at some point.

The ranked monitor supported ended up needing a RawMonitor itself (dbgRawMonitor_lock). This RawMonitor is therefore not part of the ranked monitor support. It could probably use a better name. Suggestions are welcomed.

There are a few issues I fixed while working on the ranked monitor support:

During debug agent initialization we need to call util_initialize() before commonRef_initialize() because the later creates a monitor, and that depends on util_initialize() having already been called.

getLocks() in threadControl.c wasn't grabbing locks in the right order. This could have lead to a deadlock, but seems the structure of how locks are used prevented one from happening.

In invoker_completeInvokeRequest() I had to add grabbing the stepLock in order to maintain proper lock order when getLocks() was eventually called (which also grabbed stepLock). Other than triggering a rank violation assert, this doesn't seem to have caused any real issue, but did leave the potential for a deadlock.

In handleInterrupt() I noticed that a local JNI ref was not being freed. I noticed because there were other places where I had added calls to threadControl_currentThread() that resulted in a noticeable JNI ref leak. When I fixed those, I fixed the one in handleInterrupt() too even tough it was not causing any problems.

The ranked monitor APIs needed some extra safeguards during VM exit that are not needed for the unranked monitor APIs. This is because the unranked APIs didn't have a problem with the current thread being NULL, but the ranked APIs require a valid current thread. That is why you see the following check:

    if (gdata->vmDead && current_thread == NULL) {
        return;
    }

plummercj avatar May 01 '24 21:05 plummercj

Chris, I'm looking at this fix.

sspitsyn avatar May 07 '24 00:05 sspitsyn

I added a verifyMonitorRank() when doing a debugMonitorWait() since wait does an exit and enter, so we should do the same rank check that we do during an enter. I was a bit worried that we might be doing wait() while holding higher ranking lock, especially given how getLocks() pro-actively grabs locks before they are needed, but all tests (surprisingly) passed after this change.

plummercj avatar May 07 '24 00:05 plummercj

What do you think about the terminology usage of "rank". In this PR the term "higher ranked" means that it must be entered first (before any monitor with a lower rank), but 0 is the highest rank, and bigger numbers mean a lower rank. Sometimes when we rank things (in general) we say #1 is the best, but sometimes we instead rank with a score, and the highest score is the best. I'm wondering if for this PR the rank ordering should be reversed, so a "higher rank" is actually a higher number.

Update: Hotspot seems to use the term "higher/lowest rank" in the same way, but applies the lowest rank number to the lowest ranked mutex, so this would suggest that I should reverse the rank orders.

plummercj avatar May 07 '24 18:05 plummercj

What do you think about the terminology usage of "rank".

I slightly prefer to have the same rank order as Hotspot uses. At the same time, I'm not sure if it won't cause any issues/difficulties. But I'd give it a try.

sspitsyn avatar May 08 '24 15:05 sspitsyn

Note we can't check if the current thread owns a lock without grabbing dbgRawMonitor. In fact the main purpose of dbgRawMonitor is to allow the current thread to check if it owns the monitor.

That smells fishy to me. A thread can always safely check if it owns something because it can't race with itself and get a wrong answer.

dholmes-ora avatar May 20 '24 01:05 dholmes-ora

That smells fishy to me. A thread can always safely check if it owns something because it can't race with itself and get a wrong answer.

Unfortunately checking ownership means comparing jobjects, which you can't safely do if you are comparing to a jobject that could be freed at any moment (I'm referring the JNI ref being freed, not the object being GC'd). I ran into asserts in JNI IsSameObject() due to this happening. Basically the local ref became invalid between the time it was fetched from the DebugRawMonitor and when is was used by IsSameObject().

plummercj avatar May 20 '24 03:05 plummercj

Unfortunately checking ownership means comparing jobjects, which you can't safely do if you are comparing to a jobject that could be freed at any moment

Okay but how does this dbgRawMonitor lock prevent the GC from clearing a jobject?

dholmes-ora avatar May 20 '24 04:05 dholmes-ora

Okay but how does this dbgRawMonitor lock prevent the GC from clearing a jobject?

The JNI ref. It's actually a global ref (I said local ref earlier). As long as any setters of DebugRawMonitor::ownerThread (setting the thread or clearing it) grab dbgRawMonitor first, and the current thread accessing ownerThread does the same, then you know that whatever ownerThread points to is either NULL, the current thread, or some other thread that is still alive and is actually holding the monitor that the DebugRawMonitor is encapsulating. As long as the current thread continues to hold dbgRawMonitor, no other thread will be able to change ownerThread. This means that if some other thread currently owns the monitor that the DebugRawMonitor is encapsulating, then it will first block on dbgRawMonitor before trying to release it and clear ownerThread.

plummercj avatar May 20 '24 04:05 plummercj

@plummercj This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper[bot] avatar Jun 19 '24 22:06 bridgekeeper[bot]

keep alive

plummercj avatar Jun 24 '24 17:06 plummercj

@plummercj This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper[bot] avatar Jul 22 '24 19:07 bridgekeeper[bot]

Keep alive. I'll be continuing work on this soon.

plummercj avatar Jul 22 '24 19:07 plummercj

@plummercj This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper[bot] avatar Sep 14 '24 01:09 bridgekeeper[bot]

@plummercj This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

bridgekeeper[bot] avatar Oct 12 '24 02:10 bridgekeeper[bot]

/open

plummercj avatar Oct 12 '24 05:10 plummercj

@plummercj This pull request is now open

openjdk[bot] avatar Oct 12 '24 05:10 openjdk[bot]