[2.17.1] AsyncLogger thread is stuck on 100% CPU on lmax-34. queue polling
Description
Some low probability race condition causes lmax to loop indefinitely on Thread.yield
Configuration
-DLog4jContextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector -Dlog4j2.enable.threadlocals=true -Dlog4j2.enable.direct.encoders=true -XX:CompileCommand=dontinline,org.apache.logging.log4j.core.async.perftest.NoOpIdleStrategy::idle
Version: 2.17.1
Operating system: Centos 8
JDK: OpenJDK 64-Bit Server VM GraalVM LIBGRAAL 20.3.3 (build 11.0.12+5-jvmci-20.3-b06)
Logs
It would seem that the queue datastructure becomes corrupted and cas=uses the polling thread to loop infinitely with the following stacktrace:
"Log4j2-TF-1-AsyncLogger[AsyncContext@46f5f779]-1" #18 daemon prio=5 os_prio=0 cpu=4404755.82ms elapsed=4733.22s tid=0x00007fed7ade9800 nid=0xcd runnable [0x00007fed7bffd000]
java.lang.Thread.State: RUNNABLE
at java.lang.Thread.yield([email protected]/Native Method)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.transferAfterCancelledWait([email protected]/AbstractQueuedSynchronizer.java:1752)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos([email protected]/AbstractQueuedSynchronizer.java:2119)
at com.lmax.disruptor.TimeoutBlockingWaitStrategy.waitFor(TimeoutBlockingWaitStrategy.java:38)
at com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:159)
at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
at java.lang.Thread.run([email protected]/Thread.java:829)
Reproduction
This was found on a large production system with very low probability. Using log4j with the above settings. We don't know the exact case where the polling thread reaches this state
@shloim, the upcoming Log4j 2 release (i.e., 2.23.0) will be the very first Log4j 2 version that ships Disruptor 4 support. (Disruptor 3 will still work too.) Is that something you can try?
Alternatively, can you upgrade your Log4j version to 2.18.0 or later?
In 2.18.0 @remkop backported the TimeoutBlockingWaitStrategy from Disruptor 4.x with such a comment:
https://github.com/apache/logging-log4j2/blob/252f1a65e1b44fa78c622b4140d9116bf57d3617/log4j-core/src/main/java/org/apache/logging/log4j/core/async/TimeoutBlockingWaitStrategy.java#L34-L40
The difference in the locking mechanism seems relevant to your case.
Alternatively, can you upgrade your Log4j version to 2.18.0 or later?
In 2.18.0 @remkop backported the
TimeoutBlockingWaitStrategyfrom Disruptor 4.x with such a comment:https://github.com/apache/logging-log4j2/blob/252f1a65e1b44fa78c622b4140d9116bf57d3617/log4j-core/src/main/java/org/apache/logging/log4j/core/async/TimeoutBlockingWaitStrategy.java#L34-L40
The difference in the locking mechanism seems relevant to your case.
Do I need to set a custom WaitStrategy or can I keep the default behavior?
@shloim,
"Our" TimeoutBlockingWaitStrategy should be the default one.
After switching to log4j 2.18.0 the issue persists but in a different stacktrace:
"Log4j2-TF-1-AsyncLogger[AsyncContext@46f5f779]-1" #20 daemon prio=5 os_prio=0 cpu=3228328.37ms elapsed=3495.56s tid=0x00007fea233e9000 nid=0xd0 runnable [0x00007fea23dfd000]
java.lang.Thread.State: RUNNABLE
at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:159)
at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
at java.lang.Thread.run([email protected]/Thread.java:829)
@shloim, can you give 2.23.0-SNAPSHOT a try in combination with Disruptor 4? (See docs on how to configure Maven/Gradle/SBT for snapshot builds.)
Hi, I'm seeing this with 2.23.1 and Disruptor 4.0.0 in my application.
It's happening when i use:
Log4jContextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector
AsyncLogger.WaitStrategy=Yield
"Log4j2-TF-1-AsyncLogger[AsyncContext@3b192d32]-1" - Thread t@14
java.lang.Thread.State: RUNNABLE
at [email protected]/java.lang.Thread.yield(Native Method)
at app//com.lmax.disruptor.YieldingWaitStrategy.applyWaitMethod(YieldingWaitStrategy.java:58)
at app//com.lmax.disruptor.YieldingWaitStrategy.waitFor(YieldingWaitStrategy.java:40)
at app//com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
at app//com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:156)
at app//com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:122)
at [email protected]/java.lang.Thread.run(Thread.java:829)
Locked ownable synchronizers:
- None
Operating system: MacOS 14.4.1
JDK: openjdk:11-jre
Overall, i'm thinking this may be by design of the YieldingWaitStrategy which states
This strategy will use 100% CPU, but will more readily give up the CPU than a busy spin strategy if other threads require CPU resource.
Perhaps the Log4j async manual could be more clear which currently states:
Yield is a strategy that uses a Thread.yield() for waiting for log events after an initially spinning. Yield is a good compromise between performance and CPU resource, but may use more CPU than Sleep in order to get the message logged to disk sooner.
Perhaps the Log4j async manual could be more clear
We can indeed do this. Updated the #2534 description accordingly.
@shloim, any updates? Were you able to reproduce the issue with Log4j 2.23.1 using Disruptor 4?
The documentation of Yield was fixed in #2577 .