logging-log4j2 icon indicating copy to clipboard operation
logging-log4j2 copied to clipboard

[2.17.1] AsyncLogger thread is stuck on 100% CPU on lmax-34. queue polling

Open shloim opened this issue 2 years ago • 9 comments

Description

Some low probability race condition causes lmax to loop indefinitely on Thread.yield

Configuration

-DLog4jContextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector -Dlog4j2.enable.threadlocals=true -Dlog4j2.enable.direct.encoders=true -XX:CompileCommand=dontinline,org.apache.logging.log4j.core.async.perftest.NoOpIdleStrategy::idle

Version: 2.17.1

Operating system: Centos 8

JDK: OpenJDK 64-Bit Server VM GraalVM LIBGRAAL 20.3.3 (build 11.0.12+5-jvmci-20.3-b06)

Logs

It would seem that the queue datastructure becomes corrupted and cas=uses the polling thread to loop infinitely with the following stacktrace:

"Log4j2-TF-1-AsyncLogger[AsyncContext@46f5f779]-1" #18 daemon prio=5 os_prio=0 cpu=4404755.82ms elapsed=4733.22s tid=0x00007fed7ade9800 nid=0xcd runnable  [0x00007fed7bffd000]
   java.lang.Thread.State: RUNNABLE
        at java.lang.Thread.yield([email protected]/Native Method)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.transferAfterCancelledWait([email protected]/AbstractQueuedSynchronizer.java:1752)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos([email protected]/AbstractQueuedSynchronizer.java:2119)
        at com.lmax.disruptor.TimeoutBlockingWaitStrategy.waitFor(TimeoutBlockingWaitStrategy.java:38)
        at com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
        at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:159)
        at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
        at java.lang.Thread.run([email protected]/Thread.java:829)

Reproduction

This was found on a large production system with very low probability. Using log4j with the above settings. We don't know the exact case where the polling thread reaches this state

shloim avatar Jan 07 '24 14:01 shloim

@shloim, the upcoming Log4j 2 release (i.e., 2.23.0) will be the very first Log4j 2 version that ships Disruptor 4 support. (Disruptor 3 will still work too.) Is that something you can try?

vy avatar Jan 08 '24 08:01 vy

Alternatively, can you upgrade your Log4j version to 2.18.0 or later?

In 2.18.0 @remkop backported the TimeoutBlockingWaitStrategy from Disruptor 4.x with such a comment:

https://github.com/apache/logging-log4j2/blob/252f1a65e1b44fa78c622b4140d9116bf57d3617/log4j-core/src/main/java/org/apache/logging/log4j/core/async/TimeoutBlockingWaitStrategy.java#L34-L40

The difference in the locking mechanism seems relevant to your case.

ppkarwasz avatar Jan 08 '24 08:01 ppkarwasz

Alternatively, can you upgrade your Log4j version to 2.18.0 or later?

In 2.18.0 @remkop backported the TimeoutBlockingWaitStrategy from Disruptor 4.x with such a comment:

https://github.com/apache/logging-log4j2/blob/252f1a65e1b44fa78c622b4140d9116bf57d3617/log4j-core/src/main/java/org/apache/logging/log4j/core/async/TimeoutBlockingWaitStrategy.java#L34-L40

The difference in the locking mechanism seems relevant to your case.

Do I need to set a custom WaitStrategy or can I keep the default behavior?

shloim avatar Jan 08 '24 13:01 shloim

@shloim,

"Our" TimeoutBlockingWaitStrategy should be the default one.

ppkarwasz avatar Jan 08 '24 14:01 ppkarwasz

After switching to log4j 2.18.0 the issue persists but in a different stacktrace:

"Log4j2-TF-1-AsyncLogger[AsyncContext@46f5f779]-1" #20 daemon prio=5 os_prio=0 cpu=3228328.37ms elapsed=3495.56s tid=0x00007fea233e9000 nid=0xd0 runnable  [0x00007fea23dfd000]
   java.lang.Thread.State: RUNNABLE
        at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:159)
        at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
        at java.lang.Thread.run([email protected]/Thread.java:829)

shloim avatar Jan 14 '24 12:01 shloim

@shloim, can you give 2.23.0-SNAPSHOT a try in combination with Disruptor 4? (See docs on how to configure Maven/Gradle/SBT for snapshot builds.)

vy avatar Jan 14 '24 19:01 vy

Hi, I'm seeing this with 2.23.1 and Disruptor 4.0.0 in my application.

It's happening when i use:

Log4jContextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector
AsyncLogger.WaitStrategy=Yield
"Log4j2-TF-1-AsyncLogger[AsyncContext@3b192d32]-1" - Thread t@14
   java.lang.Thread.State: RUNNABLE
	at [email protected]/java.lang.Thread.yield(Native Method)
	at app//com.lmax.disruptor.YieldingWaitStrategy.applyWaitMethod(YieldingWaitStrategy.java:58)
	at app//com.lmax.disruptor.YieldingWaitStrategy.waitFor(YieldingWaitStrategy.java:40)
	at app//com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
	at app//com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:156)
	at app//com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:122)
	at [email protected]/java.lang.Thread.run(Thread.java:829)

   Locked ownable synchronizers:
	- None

Operating system: MacOS 14.4.1

JDK: openjdk:11-jre

Overall, i'm thinking this may be by design of the YieldingWaitStrategy which states

This strategy will use 100% CPU, but will more readily give up the CPU than a busy spin strategy if other threads require CPU resource.

Perhaps the Log4j async manual could be more clear which currently states:

Yield is a strategy that uses a Thread.yield() for waiting for log events after an initially spinning. Yield is a good compromise between performance and CPU resource, but may use more CPU than Sleep in order to get the message logged to disk sooner.

AndreKurait avatar May 04 '24 04:05 AndreKurait

Perhaps the Log4j async manual could be more clear

We can indeed do this. Updated the #2534 description accordingly.

vy avatar May 06 '24 09:05 vy

@shloim, any updates? Were you able to reproduce the issue with Log4j 2.23.1 using Disruptor 4?

vy avatar May 06 '24 09:05 vy

The documentation of Yield was fixed in #2577 .

ppkarwasz avatar May 21 '24 20:05 ppkarwasz