Out of direct memory where flush wal buffer to disk in flush method, please ensure this.
Hi: Mybe this code has a bug: memory leak. Need add the line tlb.remove() to the file 469 line . file: org.apache.ignite.internal.processors.cache.persistence.wal.filehandle.FsyncFileWriteHandle#flush(org.apache.ignite.internal.pagemem.wal.record.WALRecord, boolean)
try {
// We can safely allow other threads to start building next chains while we are doing flush here.
ByteBuffer buf;
boolean tmpBuf = false;
if (expHead.chainSize() > tlbSize) {
buf = GridUnsafe.allocateBuffer(expHead.chainSize());
tmpBuf = true; // We need to manually release this temporary direct buffer.
}
else
buf = tlb.get();
try {
long pos = fillBuffer(buf, expHead);
writeBuffer(pos, buf);
}
finally {
if (tmpBuf)
GridUnsafe.freeBuffer(buf);
// need add this code here **_tlb.remove();_**
}
return true;
}
catch (Throwable e) {
StorageException se = e instanceof StorageException ? (StorageException)e :
new StorageException("Unable to write", new IOException(e));
cctx.kernalContext().failure().process(new FailureContext(CRITICAL_ERROR, se));
// All workers waiting for a next segment must be woken up and stopped
signalNextAvailable();
throw se;
}
Because we often encounter OutOfMemoryError: Cannot reserve 131072 bytes...... The detail error is below:
2024-06-01T00:20:07,454Z+0000|WARN|sys-stripe-2-#3|Log4J2Logger#warning:523|Still waiting for a concurrent write to complete [written=165737, pos=1448132, lastFsyncPos=165737, stop=false, actualPos=165737]
2024-06-01T00:20:08,259Z+0000|INFO|qtp1471464273-162|SharedClient#testConnection:41|user is added [[42]]
2024-06-01T00:20:08,964Z+0000|ERROR|checkpoint-runner-cpu-#76|Log4J2Logger#error:533|Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.i.processors.cache.persistence.StorageException: Unable to write]]
org.apache.ignite.internal.processors.cache.persistence.StorageException: Unable to write
at org.apache.ignite.internal.processors.cache.persistence.wal.filehandle.FsyncFileWriteHandle.flush(FsyncFileWriteHandle.java:475) ~[ignite-core-2.16.0.jar:2.16.0]
at org.apache.ignite.internal.processors.cache.persistence.wal.filehandle.FsyncFileWriteHandle.addRecord(FsyncFileWriteHandle.java:267) ~[ignite-core-2.16.0.jar:2.16.0]
at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.log(FileWriteAheadLogManager.java:968) ~[ignite-core-2.16.0.jar:2.16.0]
at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.log(FileWriteAheadLogManager.java:921) ~[ignite-core-2.16.0.jar:2.16.0]
at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.beforeReleaseWrite(PageMemoryImpl.java:1867) ~[ignite-core-2.16.0.jar:2.16.0]
at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlockPage(PageMemoryImpl.java:1698) ~[ignite-core-2.16.0.jar:2.16.0]
at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:523) ~[ignite-core-2.16.0.jar:2.16.0]
at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:515) ~[ignite-core-2.16.0.jar:2.16.0]
at org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.writeUnlock(PageHandler.java:416) ~[ignite-core-2.16.0.jar:2.16.0]
at org.apache.ignite.internal.processors.cache.persistence.DataStructure.writeUnlock(DataStructure.java:276) ~[ignite-core-2.16.0.jar:2.16.0]
at org.apache.ignite.internal.processors.cache.persistence.DataStructure.writeUnlock(DataStructure.java:247) ~[ignite-core-2.16.0.jar:2.16.0]
at org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.put(PagesList.java:933) ~[ignite-core-2.16.0.jar:2.16.0]
at org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList$PutBucket.run(PagesList.java:199) ~[ignite-core-2.16.0.jar:2.16.0]
at org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList$PutBucket.run(PagesList.java:173) ~[ignite-core-2.16.0.jar:2.16.0]
at org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.writePage(PageHandler.java:313) ~[ignite-core-2.16.0.jar:2.16.0]
at org.apache.ignite.internal.processors.cache.persistence.DataStructure.write(DataStructure.java:304) ~[ignite-core-2.16.0.jar:2.16.0]
at org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.flushBucketsCache(PagesList.java:415) ~[ignite-core-2.16.0.jar:2.16.0]
at org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.saveMetadata(PagesList.java:356) ~[ignite-core-2.16.0.jar:2.16.0]
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.saveStoreMetadata(GridCacheOffheapManager.java:322) ~[ignite-core-2.16.0.jar:2.16.0]
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.lambda$syncMetadata$1(GridCacheOffheapManager.java:299) ~[ignite-core-2.16.0.jar:2.16.0]
at org.apache.ignite.internal.util.IgniteUtils.lambda$wrapIgniteFuture$5(IgniteUtils.java:11930) ~[ignite-core-2.16.0.jar:2.16.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
at java.lang.Thread.run(Unknown Source) ~[?:?]
Caused by: java.io.IOException: java.lang.OutOfMemoryError: Cannot reserve 131072 bytes of direct buffer memory (allocated: 3221191584, limit: 3221225472)
... 24 more
Caused by: java.lang.OutOfMemoryError: Cannot reserve 131072 bytes of direct buffer memory (allocated: 3221191584, limit: 3221225472)
at java.nio.Bits.reserveMemory(Unknown Source) ~[?:?]
at java.nio.DirectByteBuffer.<init>(Unknown Source) ~[?:?]
at java.nio.ByteBuffer.allocateDirect(Unknown Source) ~[?:?]
at org.apache.ignite.internal.processors.cache.persistence.wal.filehandle.FsyncFileWriteHandle$1.initialValue(FsyncFileWriteHandle.java:133) ~[ignite-core-2.16.0.jar:2.16.0]
at org.apache.ignite.internal.processors.cache.persistence.wal.filehandle.FsyncFileWriteHandle$1.initialValue(FsyncFileWriteHandle.java:131) ~[ignite-core-2.16.0.jar:2.16.0]
at java.lang.ThreadLocal.setInitialValue(Unknown Source) ~[?:?]
at java.lang.ThreadLocal.get(Unknown Source) ~[?:?]
at org.apache.ignite.internal.processors.cache.persistence.wal.filehandle.FsyncFileWriteHandle.flush(FsyncFileWriteHandle.java:460) ~[ignite-core-2.16.0.jar:2.16.0]
... 23 more
2024-06-01T00:20:08,965Z+0000|ERROR|checkpoint-runner-cpu-#76|Log4J2Logger#error:533|No deadlocked threads detected.
2024-06-01T00:20:09,918Z+0000|ERROR|ttl-cleanup-worker-#54|Log4J2Logger#error:533|Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=sys-stripe-7, threadName=sys-stripe-7-#8, blockedFor=11s]
2024-06-01T00:20:09,918Z+0000|WARN|jvm-pause-detector-worker|Log4J2Logger#warning:523|Possible too long JVM pause: 923 milliseconds.
2024-06-01T00:20:09,937Z+0000|ERROR|checkpoint-runner-cpu-#76|Log4J2Logger#error:533|Thread dump at 2024/06/01 00:20:09 UTC
Thread [name="Reference Handler", id=2, state=RUNNABLE, blockCnt=6, waitCnt=0]
at [email protected]/java.lang.ref.Reference.waitForReferencePendingList(Native Method)
at [email protected]/java.lang.ref.Reference.processPendingReferences(Unknown Source)
at [email protected]/java.lang.ref.Reference$ReferenceHandler.run(Unknown Source)
Thread [name="Finalizer", id=3, state=WAITING, blockCnt=6635, waitCnt=6636]
Lock [object=java.lang.ref.ReferenceQueue$Lock@2967a826, ownerName=null, ownerId=-1]
at [email protected]/java.lang.Object.wait(Native Method)
at [email protected]/java.lang.ref.ReferenceQueue.remove(Unknown Source)
at [email protected]/java.lang.ref.ReferenceQueue.remove(Unknown Source)
at [email protected]/java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)
Thread [name="Signal Dispatcher", id=4, state=RUNNABLE, blockCnt=0, waitCnt=0]
Thread [name="Common-Cleaner", id=11, state=TIMED_WAITING, blockCnt=276, waitCnt=5872]
Lock [object=java.lang.ref.ReferenceQueue$Lock@76e66cc0, ownerName=null, ownerId=-1]
at [email protected]/java.lang.Object.wait(Native Method)
at [email protected]/java.lang.ref.ReferenceQueue.remove(Unknown Source)
at [email protected]/jdk.internal.ref.CleanerImpl.run(Unknown Source)
at [email protected]/java.lang.Thread.run(Unknown Source)
at [email protected]/jdk.internal.misc.InnocuousThread.run(Unknown Source)
Thread [name="Notification Thread", id=12, state=RUNNABLE, blockCnt=0, waitCnt=0]
Thread [name="Apollo-RemoteConfigRepository-1", id=18, state=TIMED_WAITING, blockCnt=0, waitCnt=1124]
Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2d9b9311, ownerName=null, ownerId=-1]
at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
at [email protected]/java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source)
at [email protected]/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown Source)
at [email protected]/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown Source)
at [email protected]/java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at [email protected]/java.lang.Thread.run(Unknown Source)
......
Hello @shuangquansq ,
I can't tell if this is a mistake or not. However, the common recommendation is:
If you use Ignite native persistence, we recommend that you set the MaxDirectMemorySize JVM parameter to walSegmentSize * 4. With the default WAL settings, this value is equal to 256MB.
https://ignite.apache.org/docs/latest/perf-and-troubleshooting/memory-tuning#generic-gc-settings
Could you please share what value is used for walSegmentSize in your case?
our walSegmentSize is :
I try Adjust MaxDirectMemorySize to 3G,the issue haven't solve.