logging-log4j2 [2.17.1] AsyncLogger thread is stuck on 100% CPU on lmax-34. queue polling

Description

Some low probability race condition causes lmax to loop indefinitely on Thread.yield

Configuration

-DLog4jContextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector -Dlog4j2.enable.threadlocals=true -Dlog4j2.enable.direct.encoders=true -XX:CompileCommand=dontinline,org.apache.logging.log4j.core.async.perftest.NoOpIdleStrategy::idle

Version: 2.17.1

Operating system: Centos 8

JDK: OpenJDK 64-Bit Server VM GraalVM LIBGRAAL 20.3.3 (build 11.0.12+5-jvmci-20.3-b06)

Logs

It would seem that the queue datastructure becomes corrupted and cas=uses the polling thread to loop infinitely with the following stacktrace:

"Log4j2-TF-1-AsyncLogger[AsyncContext@46f5f779]-1" #18 daemon prio=5 os_prio=0 cpu=4404755.82ms elapsed=4733.22s tid=0x00007fed7ade9800 nid=0xcd runnable  [0x00007fed7bffd000]
   java.lang.Thread.State: RUNNABLE
        at java.lang.Thread.yield([email protected]/Native Method)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.transferAfterCancelledWait([email protected]/AbstractQueuedSynchronizer.java:1752)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos([email protected]/AbstractQueuedSynchronizer.java:2119)
        at com.lmax.disruptor.TimeoutBlockingWaitStrategy.waitFor(TimeoutBlockingWaitStrategy.java:38)
        at com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
        at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:159)
        at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
        at java.lang.Thread.run([email protected]/Thread.java:829)

Reproduction

This was found on a large production system with very low probability. Using log4j with the above settings. We don't know the exact case where the polling thread reaches this state

Jan 07 '24 14:01 shloim

@shloim, the upcoming Log4j 2 release (i.e., 2.23.0) will be the very first Log4j 2 version that ships Disruptor 4 support. (Disruptor 3 will still work too.) Is that something you can try?

Jan 08 '24 08:01 vy

Alternatively, can you upgrade your Log4j version to 2.18.0 or later?

In 2.18.0 @remkop backported the TimeoutBlockingWaitStrategy from Disruptor 4.x with such a comment:

https://github.com/apache/logging-log4j2/blob/252f1a65e1b44fa78c622b4140d9116bf57d3617/log4j-core/src/main/java/org/apache/logging/log4j/core/async/TimeoutBlockingWaitStrategy.java#L34-L40

The difference in the locking mechanism seems relevant to your case.

Jan 08 '24 08:01 ppkarwasz

Alternatively, can you upgrade your Log4j version to 2.18.0 or later?

In 2.18.0 @remkop backported the TimeoutBlockingWaitStrategy from Disruptor 4.x with such a comment:

https://github.com/apache/logging-log4j2/blob/252f1a65e1b44fa78c622b4140d9116bf57d3617/log4j-core/src/main/java/org/apache/logging/log4j/core/async/TimeoutBlockingWaitStrategy.java#L34-L40

The difference in the locking mechanism seems relevant to your case.

Do I need to set a custom WaitStrategy or can I keep the default behavior?

Jan 08 '24 13:01 shloim

@shloim,

"Our" TimeoutBlockingWaitStrategy should be the default one.

Jan 08 '24 14:01 ppkarwasz

After switching to log4j 2.18.0 the issue persists but in a different stacktrace:

"Log4j2-TF-1-AsyncLogger[AsyncContext@46f5f779]-1" #20 daemon prio=5 os_prio=0 cpu=3228328.37ms elapsed=3495.56s tid=0x00007fea233e9000 nid=0xd0 runnable  [0x00007fea23dfd000]
   java.lang.Thread.State: RUNNABLE
        at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:159)
        at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
        at java.lang.Thread.run([email protected]/Thread.java:829)

Jan 14 '24 12:01 shloim

@shloim, can you give 2.23.0-SNAPSHOT a try in combination with Disruptor 4? (See docs on how to configure Maven/Gradle/SBT for snapshot builds.)

Jan 14 '24 19:01 vy

Hi, I'm seeing this with 2.23.1 and Disruptor 4.0.0 in my application.

It's happening when i use:

Log4jContextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector
AsyncLogger.WaitStrategy=Yield

"Log4j2-TF-1-AsyncLogger[AsyncContext@3b192d32]-1" - Thread t@14
   java.lang.Thread.State: RUNNABLE
	at [email protected]/java.lang.Thread.yield(Native Method)
	at app//com.lmax.disruptor.YieldingWaitStrategy.applyWaitMethod(YieldingWaitStrategy.java:58)
	at app//com.lmax.disruptor.YieldingWaitStrategy.waitFor(YieldingWaitStrategy.java:40)
	at app//com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
	at app//com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:156)
	at app//com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:122)
	at [email protected]/java.lang.Thread.run(Thread.java:829)

   Locked ownable synchronizers:
	- None

Operating system: MacOS 14.4.1

JDK: openjdk:11-jre

Overall, i'm thinking this may be by design of the YieldingWaitStrategy which states

This strategy will use 100% CPU, but will more readily give up the CPU than a busy spin strategy if other threads require CPU resource.

Perhaps the Log4j async manual could be more clear which currently states:

Yield is a strategy that uses a Thread.yield() for waiting for log events after an initially spinning. Yield is a good compromise between performance and CPU resource, but may use more CPU than Sleep in order to get the message logged to disk sooner.

May 04 '24 04:05 AndreKurait

Perhaps the Log4j async manual could be more clear

We can indeed do this. Updated the #2534 description accordingly.

May 06 '24 09:05 vy

@shloim, any updates? Were you able to reproduce the issue with Log4j 2.23.1 using Disruptor 4?

May 06 '24 09:05 vy

The documentation of Yield was fixed in #2577 .

May 21 '24 20:05 ppkarwasz