ignite icon indicating copy to clipboard operation
ignite copied to clipboard

Out of direct memory where flush wal buffer to disk in flush method, please ensure this.

Open shuangquansq opened this issue 1 year ago • 2 comments

Hi: Mybe this code has a bug: memory leak. Need add the line tlb.remove() to the file 469 line . file: org.apache.ignite.internal.processors.cache.persistence.wal.filehandle.FsyncFileWriteHandle#flush(org.apache.ignite.internal.pagemem.wal.record.WALRecord, boolean)

   try {
        // We can safely allow other threads to start building next chains while we are doing flush here.
        ByteBuffer buf;

        boolean tmpBuf = false;

        if (expHead.chainSize() > tlbSize) {
            buf = GridUnsafe.allocateBuffer(expHead.chainSize());

            tmpBuf = true; // We need to manually release this temporary direct buffer.
        }
        else
            buf = tlb.get();

        try {
            long pos = fillBuffer(buf, expHead);

            writeBuffer(pos, buf);
        }
        finally {
            if (tmpBuf)
                GridUnsafe.freeBuffer(buf);
            // need add this code here **_tlb.remove();_**
        }

        return true;
    }
    catch (Throwable e) {
        StorageException se = e instanceof StorageException ? (StorageException)e :
            new StorageException("Unable to write", new IOException(e));

        cctx.kernalContext().failure().process(new FailureContext(CRITICAL_ERROR, se));

        // All workers waiting for a next segment must be woken up and stopped
        signalNextAvailable();

        throw se;
    }

Because we often encounter OutOfMemoryError: Cannot reserve 131072 bytes...... The detail error is below:

2024-06-01T00:20:07,454Z+0000|WARN|sys-stripe-2-#3|Log4J2Logger#warning:523|Still waiting for a concurrent write to complete [written=165737, pos=1448132, lastFsyncPos=165737, stop=false, actualPos=165737]
2024-06-01T00:20:08,259Z+0000|INFO|qtp1471464273-162|SharedClient#testConnection:41|user is added [[42]]
2024-06-01T00:20:08,964Z+0000|ERROR|checkpoint-runner-cpu-#76|Log4J2Logger#error:533|Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.i.processors.cache.persistence.StorageException: Unable to write]]
org.apache.ignite.internal.processors.cache.persistence.StorageException: Unable to write
	at org.apache.ignite.internal.processors.cache.persistence.wal.filehandle.FsyncFileWriteHandle.flush(FsyncFileWriteHandle.java:475) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.wal.filehandle.FsyncFileWriteHandle.addRecord(FsyncFileWriteHandle.java:267) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.log(FileWriteAheadLogManager.java:968) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.log(FileWriteAheadLogManager.java:921) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.beforeReleaseWrite(PageMemoryImpl.java:1867) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlockPage(PageMemoryImpl.java:1698) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:523) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:515) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.writeUnlock(PageHandler.java:416) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.DataStructure.writeUnlock(DataStructure.java:276) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.DataStructure.writeUnlock(DataStructure.java:247) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.put(PagesList.java:933) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList$PutBucket.run(PagesList.java:199) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList$PutBucket.run(PagesList.java:173) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.writePage(PageHandler.java:313) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.DataStructure.write(DataStructure.java:304) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.flushBucketsCache(PagesList.java:415) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.saveMetadata(PagesList.java:356) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.saveStoreMetadata(GridCacheOffheapManager.java:322) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.lambda$syncMetadata$1(GridCacheOffheapManager.java:299) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.util.IgniteUtils.lambda$wrapIgniteFuture$5(IgniteUtils.java:11930) ~[ignite-core-2.16.0.jar:2.16.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
	at java.lang.Thread.run(Unknown Source) ~[?:?]
Caused by: java.io.IOException: java.lang.OutOfMemoryError: Cannot reserve 131072 bytes of direct buffer memory (allocated: 3221191584, limit: 3221225472)
	... 24 more
Caused by: java.lang.OutOfMemoryError: Cannot reserve 131072 bytes of direct buffer memory (allocated: 3221191584, limit: 3221225472)
	at java.nio.Bits.reserveMemory(Unknown Source) ~[?:?]
	at java.nio.DirectByteBuffer.<init>(Unknown Source) ~[?:?]
	at java.nio.ByteBuffer.allocateDirect(Unknown Source) ~[?:?]
	at org.apache.ignite.internal.processors.cache.persistence.wal.filehandle.FsyncFileWriteHandle$1.initialValue(FsyncFileWriteHandle.java:133) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.wal.filehandle.FsyncFileWriteHandle$1.initialValue(FsyncFileWriteHandle.java:131) ~[ignite-core-2.16.0.jar:2.16.0]
	at java.lang.ThreadLocal.setInitialValue(Unknown Source) ~[?:?]
	at java.lang.ThreadLocal.get(Unknown Source) ~[?:?]
	at org.apache.ignite.internal.processors.cache.persistence.wal.filehandle.FsyncFileWriteHandle.flush(FsyncFileWriteHandle.java:460) ~[ignite-core-2.16.0.jar:2.16.0]
	... 23 more
2024-06-01T00:20:08,965Z+0000|ERROR|checkpoint-runner-cpu-#76|Log4J2Logger#error:533|No deadlocked threads detected.
2024-06-01T00:20:09,918Z+0000|ERROR|ttl-cleanup-worker-#54|Log4J2Logger#error:533|Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=sys-stripe-7, threadName=sys-stripe-7-#8, blockedFor=11s]
2024-06-01T00:20:09,918Z+0000|WARN|jvm-pause-detector-worker|Log4J2Logger#warning:523|Possible too long JVM pause: 923 milliseconds.
2024-06-01T00:20:09,937Z+0000|ERROR|checkpoint-runner-cpu-#76|Log4J2Logger#error:533|Thread dump at 2024/06/01 00:20:09 UTC
Thread [name="Reference Handler", id=2, state=RUNNABLE, blockCnt=6, waitCnt=0]
        at [email protected]/java.lang.ref.Reference.waitForReferencePendingList(Native Method)
        at [email protected]/java.lang.ref.Reference.processPendingReferences(Unknown Source)
        at [email protected]/java.lang.ref.Reference$ReferenceHandler.run(Unknown Source)

Thread [name="Finalizer", id=3, state=WAITING, blockCnt=6635, waitCnt=6636]
    Lock [object=java.lang.ref.ReferenceQueue$Lock@2967a826, ownerName=null, ownerId=-1]
        at [email protected]/java.lang.Object.wait(Native Method)
        at [email protected]/java.lang.ref.ReferenceQueue.remove(Unknown Source)
        at [email protected]/java.lang.ref.ReferenceQueue.remove(Unknown Source)
        at [email protected]/java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)

Thread [name="Signal Dispatcher", id=4, state=RUNNABLE, blockCnt=0, waitCnt=0]

Thread [name="Common-Cleaner", id=11, state=TIMED_WAITING, blockCnt=276, waitCnt=5872]
    Lock [object=java.lang.ref.ReferenceQueue$Lock@76e66cc0, ownerName=null, ownerId=-1]
        at [email protected]/java.lang.Object.wait(Native Method)
        at [email protected]/java.lang.ref.ReferenceQueue.remove(Unknown Source)
        at [email protected]/jdk.internal.ref.CleanerImpl.run(Unknown Source)
        at [email protected]/java.lang.Thread.run(Unknown Source)
        at [email protected]/jdk.internal.misc.InnocuousThread.run(Unknown Source)

Thread [name="Notification Thread", id=12, state=RUNNABLE, blockCnt=0, waitCnt=0]

Thread [name="Apollo-RemoteConfigRepository-1", id=18, state=TIMED_WAITING, blockCnt=0, waitCnt=1124]
    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2d9b9311, ownerName=null, ownerId=-1]
        at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
        at [email protected]/java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
        at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source)
        at [email protected]/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown Source)
        at [email protected]/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown Source)
        at [email protected]/java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
        at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at [email protected]/java.lang.Thread.run(Unknown Source)
......

shuangquansq avatar Jun 06 '24 06:06 shuangquansq

Hello @shuangquansq ,

I can't tell if this is a mistake or not. However, the common recommendation is:

If you use Ignite native persistence, we recommend that you set the MaxDirectMemorySize JVM parameter to walSegmentSize * 4. With the default WAL settings, this value is equal to 256MB.

https://ignite.apache.org/docs/latest/perf-and-troubleshooting/memory-tuning#generic-gc-settings

Could you please share what value is used for walSegmentSize in your case?

sk0x50 avatar Jun 17 '24 07:06 sk0x50

our walSegmentSize is : 64MB. defaultDataRegionConfiguration: 1.2G -XX:MaxDirectMemorySize=1.8G system region: 100M checkpoint: 256M

I try Adjust MaxDirectMemorySize to 3G,the issue haven't solve.

shuangquansq avatar Jun 21 '24 02:06 shuangquansq