mat icon indicating copy to clipboard operation
mat copied to clipboard

Performance degradation in 1.16.1 - GarbageCleaner.markUnreachableAsGCRoots

Open sadaaithal opened this issue 5 months ago • 8 comments

I have a 7g heap from one of our customers. I suspect there's a lot of unreachable objects in it, because normal MAT parsing shows the heap occupancy at ~120mb. My colleage was able to process this heap on his Mac with the "keep unreachable objects" check box selected.

For some reason, for me it is stuck for a long time on both Windows and Linux. The stack is consistent

"Worker-0: Parsing heap dump from 'C:\Users\saithal\Downloads\heap_afterSuccessStreamingRuns_allObj\heap_afterSuccessStreamingRuns_allObj.hprof'" #36 prio=5 os_prio=0 cpu=2207171.88ms elapsed=3016.26s tid=0x000001a572ef5570 nid=0x15c3c runnable  [0x00000006726fd000]
   java.lang.Thread.State: RUNNABLE
        at org.eclipse.mat.parser.internal.snapshot.ObjectMarker.countMarked(ObjectMarker.java:177)
        at org.eclipse.mat.parser.internal.snapshot.ObjectMarker.markSingleThreaded(ObjectMarker.java:148)
        at org.eclipse.mat.parser.internal.GarbageCleaner.markUnreachableAsGCRoots(GarbageCleaner.java:935)
        at org.eclipse.mat.parser.internal.GarbageCleaner.clean(GarbageCleaner.java:160)
        at org.eclipse.mat.parser.internal.SnapshotFactoryImpl.parse(SnapshotFactoryImpl.java:466)
        at org.eclipse.mat.parser.internal.SnapshotFactoryImpl.openSnapshot(SnapshotFactoryImpl.java:224)
        at org.eclipse.mat.snapshot.SnapshotFactory.openSnapshot(SnapshotFactory.java:149)
        at org.eclipse.mat.ui.snapshot.ParseHeapDumpJob.run(ParseHeapDumpJob.java:99)
        at org.eclipse.core.internal.jobs.Worker.run(Worker.java:63)

OR linux

Worker-1: Parsing heap dump from '/home/saithal/cases/dbmi-oom/heap_afterSuccessStreamingRuns_allObj.hprof'" #178 prio=5 os_prio=0 cpu=284351.72ms elapsed=403.30s tid=0x00007f11b2f65790 nid=0x2701a4 runnable  [0x00007ef1d2ec7000]
   java.lang.Thread.State: RUNNABLE
        at org.eclipse.mat.parser.internal.snapshot.ObjectMarker.countMarked(ObjectMarker.java:177)
        at org.eclipse.mat.parser.internal.snapshot.ObjectMarker.markSingleThreaded(ObjectMarker.java:139)
        at org.eclipse.mat.parser.internal.GarbageCleaner.markUnreachableAsGCRoots(GarbageCleaner.java:935)
        at org.eclipse.mat.parser.internal.GarbageCleaner.clean(GarbageCleaner.java:160)
        at org.eclipse.mat.parser.internal.SnapshotFactoryImpl.parse(SnapshotFactoryImpl.java:466)
        at org.eclipse.mat.parser.internal.SnapshotFactoryImpl.openSnapshot(SnapshotFactoryImpl.java:224)
        at org.eclipse.mat.snapshot.SnapshotFactory.openSnapshot(SnapshotFactory.java:149)
        at org.eclipse.mat.ui.snapshot.ParseHeapDumpJob.run(ParseHeapDumpJob.java:99)
        at org.eclipse.core.internal.jobs.Worker.run(Worker.java:63)

My linux box is server class - 128cpus and >500gb RAM

[saithal@RHEL84 dbmi-oom]$ free -g
              total        used        free      shared  buff/cache   available
Mem:            754         175         373           0         205         573
Swap:            29           0          29

[saithal@RHEL84  dbmi-oom]$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  2
...

I can provide a sanitized version of this heap for a review but im afraid that might wipe out key info that causes this issue in the first place. I am trying to understand why the ObjectMarker switches to single threaded. (going by what I see on the thread stack)

I suppose I can check the useOldMarker from 32539da. Any other ideas?

sadaaithal avatar Aug 29 '25 23:08 sadaaithal

I've been unable to reproduce this, so it would be great if you can share a sanitized heap. You could experiment with sanitizing it and then see if it's reproducible, then hopefully we can investigate.

It's a bit puzzling because countMarked should be one of the least likely to get blocked, and yet it shows up consistently.

In the meantime, yes, you could look at using the old marker. This has since been reverted to default to using the older marker in https://github.com/eclipse-mat/mat/pull/118/commits/85ddcbc03896ab46d0d691551282a439b09186f0.

jasonk000 avatar Aug 30 '25 19:08 jasonk000

linking #57 and #118

jasonk000 avatar Aug 30 '25 20:08 jasonk000

@jasonk000

If the timestamps are reliable, then i'd say the parsing of this heap took >30 hours

[saithal@RHEL84 .metadata]$ tail -f .log
Command-line arguments:  -os linux -ws gtk -arch x86_64

!ENTRY org.eclipse.mat.ui 1 0 2025-08-29 19:30:02.639
!MESSAGE Detected compressed references, because with uncompressed 64-bit references the array at 0x600400cb0 would overlap the array at 0x600400c88

!ENTRY org.eclipse.mat.ui 1 0 2025-08-29 19:30:14.540
!MESSAGE Wrote threads call stacks to /home/saithal/cases/dbmi-oom/heap_afterSuccessStreamingRuns_allObj.threads

!ENTRY org.eclipse.mat.ui 1 0 2025-08-29 19:30:18.177
!MESSAGE Heap /home/saithal/cases/dbmi-oom/heap_afterSuccessStreamingRuns_allObj.hprof contains 104,172,865 objects

MAT reports the total object count at 2025-08-29 19:30:18.177

When I checked the time the indices were created, that is Aug 31, 03:00 (for comparison, the .threads file was created at Aug 29 19:30 )

Something is definitely wrong.

[saithal@RHEL84 dbmi-oom]$ ls -lrt
total 14895112
-rw-r--r-- 1 saithal support 7628174225 Aug 29 19:06 heap_afterSuccessStreamingRuns_allObj.hprof
-rw-r--r-- 1 saithal support     129104 Aug 29 19:12 ssd
-rw-r--r-- 1 saithal support     552011 Aug 29 19:30 heap_afterSuccessStreamingRuns_allObj.threads
-rw-r--r-- 1 saithal support  794320202 Aug 31 03:00 heap_afterSuccessStreamingRuns_allObj.idx.index
-rw-r--r-- 1 saithal support  351584486 Aug 31 03:00 heap_afterSuccessStreamingRuns_allObj.o2c.index
-rw-r--r-- 1 saithal support  189868405 Aug 31 03:00 heap_afterSuccessStreamingRuns_allObj.a2s.index
-rw-r--r-- 1 saithal support 1380746253 Aug 31 03:02 heap_afterSuccessStreamingRuns_allObj.inbound.index
-rw-r--r-- 1 saithal support 1393746253 Aug 31 03:02 heap_afterSuccessStreamingRuns_allObj.outbound.index
-rw-r--r-- 1 saithal support  826635026 Aug 31 03:02 heap_afterSuccessStreamingRuns_allObj.o2hprof.index
-rw-r--r-- 1 saithal support  634919046 Aug 31 03:02 heap_afterSuccessStreamingRuns_allObj.index
-rw-r--r-- 1 saithal support  342834486 Aug 31 03:03 heap_afterSuccessStreamingRuns_allObj.domIn.index
-rw-r--r-- 1 saithal support  612166621 Aug 31 03:04 heap_afterSuccessStreamingRuns_allObj.o2ret.index
-rw-r--r-- 1 saithal support 1036900054 Aug 31 03:04 heap_afterSuccessStreamingRuns_allObj.domOut.index
-rw-r--r-- 1 saithal support     106275 Sep  2 16:58 heap_afterSuccessStreamingRuns_allObj_Leak_Suspects.zip
drwxr-xr-x 2 saithal support       4096 Sep  2 17:00 redacted

I have now generated a redacted heap.

Property Name             |Property Value
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Used heap dump            |5.3 GB
Number of objects         |104,172,865
Number of classes         |33,713
Number of class loaders   |1,008
Number of GC roots        |12,212,559
Format                    |hprof
JVM version               |openjdk version "17.0.15" 2025-04-15
OpenJDK Runtime Environment (17.0.15+6-LTS)
OpenJDK 64-Bit Server VM (build 17.0.15+6-LTS, mixed mode, sharing)
Time                      |5:00:02PM EDT
Date                      |Sep 2, 2025
Identifier size           |64-bit
Compressed object pointers|true
File path                 |/home/saithal/cases/dbmi-oom/redacted/heap_afterSuccessStreamingRuns.bin
File length               |7,709,736,697
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

I will report back on the following tests

  1. sanity test of what the redacted heap actually contains (this is my first time and I want to be sure there isnt anything here thats not supposed to)
  2. parsing the redacted heap with the "Keep unreachable objects" turned ON
  3. parsing the original heap with the "Keep unreachable objects" turned OFF

sadaaithal avatar Sep 02 '25 21:09 sadaaithal

  1. sanity test of what the redacted heap actually contains (this is my first time and I want to be sure there isnt anything here thats not supposed to)  
    

DONE - heap looks ok and string data is wiped

  1. parsing the redacted heap with the "Keep unreachable objects" turned ON
    

--- Something strange is going on. I ran test 3 before this, and for some reason even after turning this option checked, I see MAT reporting that its removing unreachable objects. Ive restarted MAT a couple of times now and still the same

Image Image
  1. parsing the original heap with the "Keep unreachable objects" turned OFF
    

DONE - parsing took ~5m

  1. parsing the redacted heap with the "Keep unreachable objects" turned OFF
    

DONE - parsing took ~5m

Redacted heapdump is ~1g. @jasonk000 do you have a suggestion on how I can provide this file (cant upload this to the ticket here) Thanks!

sadaaithal avatar Sep 02 '25 22:09 sadaaithal

I can confirm that even MAT on windows is doing the same - it reports that its removing unreachable even with the option turned on. I suppose something in the the redacted file/snapshot is causing MAT to go down this path.

Image

sadaaithal avatar Sep 02 '25 22:09 sadaaithal

I was able to parse the original heap with unreachable objects flag checked in ~5m using the latest stable build

eclipse.buildId=unknown
java.version=21.0.4
java.vendor=Azul Systems, Inc.
BootLoader constants: OS=linux, ARCH=x86_64, WS=gtk, NL=en_US
Command-line arguments:  -os linux -ws gtk -arch x86_64

!ENTRY org.eclipse.mat.ui 1 0 2025-09-02 19:09:29.285
!MESSAGE Started opening dump /home/saithal/cases/dbmi-oom/heap_afterSuccessStreamingRuns_allObj.hprof at 2025-09-02T19:09:29.277995111-04:00[America/New_York]

!ENTRY org.eclipse.mat.ui 1 0 2025-09-02 19:09:29.544
!MESSAGE Detected compressed references, because with uncompressed 64-bit references the array at 0x600400cb0 would overlap the array at 0x600400c88

!ENTRY org.eclipse.mat.ui 1 0 2025-09-02 19:09:34.391
!MESSAGE Wrote threads call stacks to /home/saithal/cases/dbmi-oom/heap_afterSuccessStreamingRuns_allObj.threads

!ENTRY org.eclipse.mat.ui 1 0 2025-09-02 19:09:36.129
!MESSAGE Heap /home/saithal/cases/dbmi-oom/heap_afterSuccessStreamingRuns_allObj.hprof contains 104,169,504 objects

!ENTRY org.eclipse.mat.ui 1 0 2025-09-02 19:11:02.290
!MESSAGE The keep unreachable objects condition is set to true. This may come either from the preference or dump condition requirements.

!ENTRY org.eclipse.mat.ui 1 0 2025-09-02 19:14:49.297
!MESSAGE Finished opening dump in 320.01 seconds (5m 20.01591431s) at 2025-09-02T19:14:49.293909421-04:00[America/New_York]

sadaaithal avatar Sep 02 '25 23:09 sadaaithal

hi @sadaaithal - thanks for the updates. I admit I'm finding it difficult to follow your notes, it looks like you are making progress though.

(1) does the redacted hprof show the same slow-parse symptoms? If so then maybe I can use it to investigate the object marker code.

If you like, you can send an email to jkoch@ my company name dot com (you can see my company in my Github profile) and I will organise a dropbox to upload the hprof

(2) does the newest build, including https://github.com/eclipse-mat/mat/commit/85ddcbc03896ab46d0d691551282a439b09186f0 work around the issue for you, are you using nightly?

Thanks! Much appreciate your help in investigating.

jasonk000 avatar Sep 03 '25 00:09 jasonk000

@jasonk000 Sorry about that. Let me summarize

Heap MAT version Time taken to parse Comments
original 1.16.1 >30 hours
redacted 1.16.1 ~5 minutes MAT removed unreachable objects even with "keep unreachable objects" option enabled
original 1.17.1 ~5 minutes
redacted 1.17.1 ~5 minutes MAT removed unreachable objects even with "keep unreachable objects" option enabled

Redacted heap doesn't show the slow parsing symptoms. But if you still need that, let me know. In 1.17.1 even the original heap parses in ~5 minutes keeping unreachable objects intact. So it looks like the default old marker code does help.

Hope this helps

sadaaithal avatar Sep 03 '25 17:09 sadaaithal