mat 17X dump parsing performance regression since commit 35b3ff7

@jasonk000,

Last week, we rolled out a new build of MAT from the latest commit to IBM support servers and we're getting reports of very slow parses up to 9 hours. The issue is reproducible and the majority of time is spent in:

Subtask: Removing unreachable objects
Marking reachable objects

Doing some basic profiling with thread dumps every 30 seconds during this step shows a flame graph with most samples in ObjectMarker.countMarked (the percentage on each frame is of all threads in each thread dump though only the Worker-X threads are relevant; the key point is that across 30 thread dumps, almost all Worker threads were in ObjectMarker.countMarked):

org/eclipse/mat/parser/internal/snapshot/ObjectMarker.countMarked (29 samples, 1.83%)
org/eclipse/mat/parser/internal/snapshot/ObjectMarker.markSingleThreaded (30 samples, 1.89%)
org/eclipse/mat/parser/internal/GarbageCleaner.markUnreachableAsGCRoots (30 samples, 1.89%)
org/eclipse/mat/parser/internal/GarbageCleaner.clean (30 samples, 1.89%)
org/eclipse/mat/parser/internal/SnapshotFactoryImpl.parse (30 samples, 1.89%)
org/eclipse/mat/parser/internal/SnapshotFactoryImpl.openSnapshot (30 samples, 1.89%)
org/eclipse/mat/snapshot/SnapshotFactory.openSnapshot (30 samples, 1.89%)
org/eclipse/mat/ui/snapshot/ParseHeapDumpJob.run (30 samples, 1.89%)
org/eclipse/core/internal/jobs/Worker.run (30 samples, 1.89%)
Worker-X (30 samples, 1.89%)

Taking a dump of MAT itself during this step shows that the boolean[] bits array has about 77 million entries. I guess it's just incredibly slow to iterate over so many elements in our support server environment. The support servers are virtualized Windows servers. The boxes don't appear to be particularly loaded. 2 CPU sockets and 8 virtual processors, about 25% utilized. 256GB RAM.

I then uploaded a test build of MAT from commit 35b3ff7 and re-parsed from scratch, and the parse completed in about 30 minutes.

I'll be reverting our MAT builds to that commit but let me know if you want me to run any debug builds as I can upload custom builds and I can re-try on this dump.

May 28 '24 19:05 kgibm

(1) I'd be interested to see if you can run a build with just this reverted?

https://github.com/eclipse-mat/mat/commit/7c2cb094fc6119999e14e1d8a04fe15c37c31f34

(2) I wonder if you can run a job with async-profiler and capture a full wall-time and CPU flamegraph of the run? 9 hours is an extraordinarily long time. 77 million boolean[] is 77MB, scanning 77MB to count 1 vs 0 should not be a 9 hour job! A flamegraph and/or output with line numbers would be perfect.

I can see this loop, maybe we're seeing some particular behavior triggered here. https://github.com/eclipse-mat/mat/blob/master/plugins/org.eclipse.mat.parser/src/org/eclipse/mat/parser/internal/GarbageCleaner.java#L917-L938

(3) Is there any chance you can share a sample hprof?

May 28 '24 19:05 jasonk000

Thanks. Sure, I'll get as much of that for you as I can, probably tomorrow. I only took thread dumps for a subset of the whole parse, so maybe ObjectMarker.countMarked is only part of the regression. Unfortunately, I can't download anything from these systems, so I can't share any dumps from it, but I can run debug logging and extract out snippets if we need to go down that path. I'll look into async-profiler...

May 28 '24 20:05 kgibm

Also, timestamps of the log parser log would be good, if you can capture them

May 28 '24 20:05 jasonk000

Just a note that this might take me some time because currently our internal build of MAT has some patches which requires running with IBM Semeru Runtimes and async-profiler doesn't support the OpenJ9 JVM. For some time, I've wanted to make our internal build not require OpenJ9 but that will take a bit of work...

However, I will also try running without commit 7c2cb09 and get back to you on that...

May 29 '24 13:05 kgibm

I checked out e06518880d7b3d192bd3d6c1d53fb4734d80bed2 and then applied 6d27b35d936a62c546a96ea465e61930f8fcaf36 and 4842f05f2afe384905d289cf1d724d2ed773c655 and re-parsed the dump from scratch and it completed quickly, so it does seem that 7c2cb094fc6119999e14e1d8a04fe15c37c31f34 is the proximate cause of the slowdown.

May 29 '24 19:05 kgibm

Got it. I think a profile would help. I think a profile would be very helpful. Sending across a hprof (redacted) might be helpful too, MAT has the export redacted=FULL option which might be a viable reproduction?

Jun 01 '24 16:06 jasonk000

I've been very busy with other work but finally getting back around to this. It turns out IBM Semeru Runtimes does work with async-profiler (despite what it says in the README); however, async-profiler is not available on Windows which is where this issue is occurring. I've added in the IBM Health Center sampling profiler (similar to JFR) and I'm parsing the dump now.

Sending across a hprof (redacted) might be helpful too, MAT has the export redacted=FULL option which might be a viable reproduction?

Unfortunately all downloading is disabled in these environments. We can only copy/paste small text output or images. I should be able to get a flamegraph from the HealthCenter data. By default, it takes thread dumps every 30 seconds.

Jun 12 '24 19:06 kgibm

I reproduced the 15 hour parse with the profiler. For the CPU samples, ~93% are in ObjectMarker.countMarked:

Samples	Self (%)	Method
2087851	93.26		org.eclipse.mat.parser.internal.snapshot.ObjectMarker.countMarked()

Invocation paths:

org.eclipse.mat.parser.internal.snapshot.ObjectMarker.countMarked
  |-org.eclipse.mat.parser.internal.snapshot.ObjectMarker.markSingleThreaded (100%)
  |  |-org.eclipse.mat.parser.internal.GarbageCleaner.markUnreachableAsGCRoots (100%)
  |  |  |-org.eclipse.mat.parser.internal.GarbageCleaner.clean (100%)
  |  |  |  |-org.eclipse.mat.parser.internal.SnapshotFactoryImpl.parse (100%)
  |  |  |  |  |-org.eclipse.mat.parser.internal.SnapshotFactoryImpl.openSnapshot (100%)
  |  |  |  |  |  |-org.eclipse.mat.snapshot.SnapshotFactory.openSnapshot (100%)
  |  |  |  |  |  |  |-org.eclipse.mat.ui.snapshot.ParseHeapDumpJob.run (100%)
  |  |  |  |  |  |  |  |-org.eclipse.core.internal.jobs.Worker.run (100%)

For the 30 second thread dumps, here's the flame graph which shows a similar thing:

flamegraph

Jun 13 '24 13:06 kgibm

Attaching an interactive SVG to the GitHub issue seems to strip/restrict the SVG capabilities somehow, but it looks to work when uploaded to a repository: https://raw.githubusercontent.com/kgibm/ExampleFlameGraphs/main/flamegraph.svg

Jun 13 '24 13:06 kgibm

Hmm, no, the raw.githubusercontent.com is more interactive but not fully interactive. You must download the file locally and then open it in your browser.

Jun 13 '24 13:06 kgibm

I'm interested to know if this is resolved in #63, though I suspect not.

Aug 11 '24 06:08 jasonk000

@jasonk000 Thanks. I've been busy and I'm also in the process of some required upgrades to our internal tooling, but I hope to test out #63 soon.

Sep 17 '24 19:09 kgibm

I finally got back around to this. The original PHD file was no longer available but I found another large one. Unfortunately, I could not reproduce the original issue when running on the suspect commits. I guess I'll just roll out the latest commits and ask our support teams to keep a watch for suspiciously long parse times and I'll re-open if needed.

Dec 04 '24 20:12 kgibm

@jasonk000 We finally rolled out the latest MAT to IBM Support servers and some dumps are showing slowdowns in the same magnitude (~17X). For example, one PHD dump takes about 2 hours to load in our old MAT build and the same dump takes more than 1.5 days (unfinished) spending excessive time in "Marking reachable objects". Here's a flame graph from thread dumps taken every 30 seconds for a few hours during this time:

The biggest path is at org.eclipse.mat.parser.internal.snapshot.ObjectMarker.countMarked

Apr 14 '25 19:04 kgibm

Opened PR https://github.com/eclipse-mat/mat/pull/118 as a workaround pending resolution of this

Apr 16 '25 15:04 kgibm

As I was creating #118, I noticed comments from Andrew that might be relevant:

The Fork/join object marker is simpler than the old way, but the old way had some performance tuning, even though sometimes threads were idle.

The old way:

Depth first search thread - with locality.

Have a local stack for objects close to the current object.

Have a local queue for remaining objects.

Use the global stack for excess objects or when local stack and queue are empty.

so one question is how well the new marking works with limited memory (insufficient to hold most of the outbound references in memory). For example the CI builds are memory constrained because of the VMs used for Jenkins, fortunately the test dumps are small. The new thread local cache could help, but was the old system of locality in marking useful?

A minor improvement for fork/join might be to split very large object arrays - for example if there is an object array with 1,000,000,000 elements can only one thread process it, or can it be split?

I think the IndexWriter getPage SoftReference also has the possible problem where the SoftReference is cleared between two gets.

[...]

The locality I was considering was also reader caching for the outbound references.

I'm not sure how to check whether our support servers have sufficient heap to hold most outbound references in memory. We're running with -Xmx32GB by default. Though I would guess that if there was any paging to/from the index files, we'd see that in the flame graphs.

Apr 16 '25 15:04 kgibm

Commit https://github.com/eclipse-mat/mat/commit/061788e957bb68d1992a15be626d377c9b89203f disables the new ObjectMarker code by default pending further investigation of this issue, though it can be explicitly enabled with -Dorg.eclipse.mat.parser.internal.snapshot.ObjectMarkerFactory.useOldMarker=false

Apr 16 '25 16:04 kgibm