scheduling OOM under no load (dump added)

Bonjour Mesdames et Messieurs,

We get OOM errors at server. dump file

We get such errors unexpectedly at random times. It tends to occur less frequent after our switch from version 7.x to 8. Still, We had it several times at periods of several days under no load.
You don't need to do anything special, just wait with idle 2 local nodes connected to server. We run a few jobs though, but it doesn't look to be related to that.
We run version 8.0.0 on ubuntu and in a docker container.
You can check the dump file, What is interesting is, a 1.1 Gb byteArray receive on network and OOM occurring while trying to create a new byteArray to store it. Delving into the array gives an array starting with "/ HTTP/1.0 \13\10\13\10" and more than billion byte value 0's. Call stack of the thread at time of OOM is :

"New I/O worker #2" prio=5 tid=16 RUNNABLE at java.lang.OutOfMemoryError.(OutOfMemoryError.java:48) at org.jboss.netty.buffer.HeapChannelBuffer.(HeapChannelBuffer.java:42) at org.jboss.netty.buffer.BigEndianHeapChannelBuffer.(BigEndianHeapChannelBuffer.java:34) at org.jboss.netty.buffer.ChannelBuffers.buffer(ChannelBuffers.java:134) at org.jboss.netty.buffer.HeapChannelBufferFactory.getBuffer(HeapChannelBufferFactory.java:68) at org.jboss.netty.buffer.AbstractChannelBufferFactory.getBuffer(AbstractChannelBufferFactory.java:48) at org.objectweb.proactive.extensions.pnp.PNPServerFrameDecoder.getBuffer(PNPServerFrameDecoder.java:194) at org.objectweb.proactive.extensions.pnp.PNPServerFrameDecoder.decode(PNPServerFrameDecoder.java:138) at org.objectweb.proactive.extensions.pnp.PNPServerFrameDecoder.callDecode(PNPServerFrameDecoder.java:104) Local Variable: java.net.InetSocketAddress#157 at org.objectweb.proactive.extensions.pnp.PNPServerFrameDecoder.handleUpstream(PNPServerFrameDecoder.java:86) Local Variable: org.objectweb.proactive.extensions.pnp.PNPServerFrameDecoder#3 Local Variable: org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext#20 Local Variable: org.jboss.netty.buffer.BigEndianHeapChannelBuffer#6 at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) Local Variable: org.jboss.netty.channel.DefaultChannelPipeline#6 Local Variable: org.jboss.netty.channel.UpstreamMessageEvent#1 at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559) at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) Local Variable: org.jboss.netty.channel.socket.nio.NioAcceptedSocketChannel#3 at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) Local Variable: sun.nio.ch.SelectionKeyImpl#40 Local Variable: java.util.HashMap$KeyIterator#1 at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337) Local Variable: sun.nio.ch.EPollSelectorImpl#4 at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) Local Variable: org.jboss.netty.channel.socket.nio.NioWorker#3 at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) Local Variable: java.lang.String#8438 Local Variable: java.lang.String#8439 Local Variable: org.jboss.netty.util.ThreadRenamingRunnable#5 at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) Local Variable: org.jboss.netty.util.internal.DeadLockProofWorker$1#4 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) Local Variable: java.util.concurrent.ThreadPoolExecutor$Worker#5 at java.lang.Thread.run(Thread.java:748)

Aug 02 '18 16:08 febael

@febael Thank you very much for reporting this issue.

Your finding is very interesting. That message received on the network looks faulty, containing basically 0.

How much maximum heap memory do you give the java virtual machine when you start?

@fviale @luispinedamx What do you think about this memory issue?

Aug 03 '18 07:08 tobwiens

@febael Sadly I cannot analyze your dump file. I downloaded it twice but I get following error: java.io.IOException: Unrecognized magic number: 529205248 at com.sun.tools.hat.internal.parser.Reader.readFile(Reader.java:94) at com.sun.tools.hat.Main.main(Main.java:159)

Aug 03 '18 07:08 tobwiens

@tobwiens hi,

There's no setting regarding the min/max memory.

Original huge dump file has the expected magic number : xxd java_pid6.hprof | head -1 00000000: 4a41 5641 2050 524f 4649 4c45 2031 2e30 JAVA PROFILE 1.0

Acting on the tar.gz file I get : xxd dump.tar.gz | head -1 00000000: 1f8b 0800 37e4 625b 0003 ec7c 7974 1377 ....7.b[...|yt.w echo $(( 16#1f8b0800 )) 529205248

So I assume you tried to directly open the tar.gz file :)

Aug 03 '18 08:08 febael

Hello @febael,

I analysed the heap dump you provided using Eclipse Memory Analyser and it does not show anything special. Like you said, the scheduler seems to be quite empty. No object occupying abnormally the memory (biggest retainer being the JVM class loader, but nothing alarming).

What you describe regarding the PNP message of 1GB is quite surprising though. A few questions below:

How did you manage to decode the network message ? Especially if this message huge size triggered the out-of-memory error, how did you manage to analyse it retroactively ?
While this OOM appeared, you were strictly doing nothing at the scheduler level ? There was no task running ? No file transfer in progress ?
Do you see any unusual error in the scheduler logs (scheduler_home/logs/Scheduler.log) ?
Could you send us the full thread dump of the JVM at the moment of the OOM ?

Aug 03 '18 11:08 fviale

Thank you @fviale for your analysis. I used the jvisualvm tool (should be shipped with every jdk installation), and I could get all thread dumps and objects in the heap with that.

How did you manage to decode the network message ? Especially if this message huge size triggered the out-of-memory error, how did you manage to analyse it retroactively ?
- The thread causing the OOM shows allocation of a byteArray in a constructor. In the biggest objects list, a byteArray is the far winner with 1.1 GB. You an easily see the contents of this byteArray with jvisualvm. It is a wrapped by many layers and resides under a DefaultChannelPipeline object, as a value in a hashMap and key is "pnpDecoder".
While this OOM appeared, you were strictly doing nothing at the scheduler level ? There was no task running ? No file transfer in progress ?
- I can say with 99.9% certainty that nothing was being done at the time of error.
Do you see any unusual error in the scheduler logs (scheduler_home/logs/Scheduler.log) ?
- Nothing unusual in Scheduler.log or Node-local-LocalNodes-0.log
Could you send us the full thread dump of the JVM at the moment of the OOM ?
- I think jvisualvm can help for this purpose

Do you have any idea on how to monitor the case for more clues?

Aug 03 '18 12:08 febael

Thank you for your detailed explanation, and I indeed observes what you describe in JVisualVM.

Apparently, one frame decoded by PNPServerFrameDecoder (part of the low-level proactive network protocol) seems excessively big. The maximum frame size set in the code is simply MAXINT bytes (which is over 2GB).

A low-level PNP message can be many many things, and based on the content of the frame bytes only, it is nearly impossible to find out the origin of the message.

What I suggest is that you change the log level of the PNP protocol to TRACE: scheduler_server/config/log/server.properties

log4j.logger.proactive.pnp = TRACE log4j.logger.proactive.remoteobject = DEBUG

This will be very verbose, but whenever the problem happens again, we might be able to find the provenance of the last message.

Aug 03 '18 18:08 fviale

Thank you @fviale, done that and will inform you with findings.

Aug 06 '18 08:08 febael