ArrayIndexOutOfBoundsException on attempt to reconnect
Defect
We observed one of our Nats clients in our production enviroment flooded log with the following exceptions.
- First 40 minutes it was throwing the following exception during publish() call (messages consumption was healthy):
java.lang.IllegalStateException: Output queue is full 5000
at io.nats.client.impl.MessageQueue.push(MessageQueue.java:124)
at io.nats.client.impl.MessageQueue.push(MessageQueue.java:110)
at io.nats.client.impl.NatsConnectionWriter.queue(NatsConnectionWriter.java:209)
at io.nats.client.impl.NatsConnection.queueOutgoing(NatsConnection.java:1353)
at io.nats.client.impl.NatsConnection.publishInternal(NatsConnection.java:765)
at io.nats.client.impl.NatsConnection.publish(NatsConnection.java:733)
- Then both publishing and consumption didn't work for another 40 minutes till service was terminated. At this time log was flooded with the following exceptions (each time with the same last destination index and byte index):
java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 175759 out of bounds for byte[131555]
at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
at io.nats.client.impl.NatsConnection.tryToConnect(NatsConnection.java:380)
at io.nats.client.impl.NatsConnection.reconnect(NatsConnection.java:254)
at io.nats.client.impl.NatsConnection.closeSocket(NatsConnection.java:583)
at io.nats.client.impl.NatsConnection.lambda$handleCommunicationIssue$3(NatsConnection.java:541)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 175759 out of bounds for byte[131555]
at java.base/java.lang.System.arraycopy(Native Method)
at io.nats.client.impl.NatsConnectionWriter.sendMessageBatch(NatsConnectionWriter.java:147)
at io.nats.client.impl.NatsConnectionWriter.run(NatsConnectionWriter.java:188)
... 5 common frames omitted
Our client configuration:
Options options = new Options.Builder()
.servers(natsBootstrapServers.addresses)
.errorListener(errorListener)
.connectionTimeout(Duration.ofSeconds(5))
.maxReconnects(-1) // -1 = Infinite
.connectionName(getHostName())
.pingInterval(Duration.ofSeconds(2))
.build();
We don't have a reproduction scenario yet.
Versions of io.nats:jnats and nats-server:
nats server v 2.7.4 jnats v 2.14.1
OS/Container environment:
adoptopenjdk/openjdk11:jdk-11.0.8_10-alpine
Steps or code to reproduce the issue:
Expected result:
Actual result:
Linux core version 5.4.170
@igor-komar You say "message consumption was healthy", what does this mean? If the output queue is full, it seems that it's having difficulty sending those messages to the server in a timely fashion. How big are your message payloads?
@scottf "message consumption was healthy" means that messages receiving was NOT interrupted on the service, only publishing didn't work. Incoming/outgoing messages have average size of ~1KB. Max size is not metered, I think it's below 50KB.
Is the consumption on the same connection as the production? I'm trying to understand why you noted consumption. I recognize that something bad is happening when publish is stopped because of a full output queue, but am really trying to understand how the queue got full in the first place.
Yes, consumption is on the same connection as the production.
but am really trying to understand how the queue got full in the first place.
It was the problem of exactly this affected instance. Other instances of the same service were consuming and publishing without problems. I guess this might be a result of race condition.
I also encountered this exact issue. I did not manage to capture a stack trace, somehow it got swallowed, not sure why. First, we get the queue full error:
17-Jun-2022 03:35:39.473 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 23, Exception: java.io.IOException: java.lang.IllegalStateException: Output queue is full 5000
And right after that
17-Jun-2022 03:35:39.489 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 23, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 65654 out of bounds for byte[65536]
And this last error seems to be repeated continuously, until I restart the application.
Relevant version numbers:
- jnats 2.14.0
- server 2.8.0
Edit: also relevant of course, the client was publishing to an on-disk jetstream, with this config:
val config = StreamConfiguration.builder()
.name(stream)
.subjects(subjects)
.retentionPolicy(RetentionPolicy.Limits)
.maxAge(maxAge)
.storageType(StorageType.File)
.replicas(3)
.build()
Publishing was done using publishAsync.
@scottf, I have been looking into the code and I believe I've found what goes wrong in NatsConnectionWriter.sendMessageBatch (related to the ArrayIndexOutOfBoundsException):
- The exception gets thrown when the
sendBufferhas a smaller size, so the current message doesn't fit. - In the logic for resizing the
sendBuffer, there is a check if it's the first item, if so the buffer is re-sized. Otherwise the buffer is sent, and we reset thesendPosition.
However, I believe two things go wrong for the following piece of code in NatsConnectionWriter.sendMessageBatch:
while (msg != null) {
long size = msg.getSizeInBytes();
if (sendPosition + size > sendBuffer.length) {
if (sendPosition == 0) { // have to resize
this.sendBuffer = new byte[(int)Math.max(sendBuffer.length + size, sendBuffer.length * 2L)];
} else { // else send and go to next message
dataPort.write(sendBuffer, sendPosition);
connection.getNatsStatistics().registerWrite(sendPosition);
sendPosition = 0;
msg = msg.next;
if (msg == null) {
break;
}
}
}
// ... use System.arraycopy, etc.
msg = msg.next;
}
- if
msg == nullwe break out of the loop (which is correct), but if there is a message than the resizing of thesendBufferisn't checked. That will cause theSystem.arraycopy(...)to throw an error, since the buffer can't accommodate the size of the new message. Changing this into acontinueinstead of theif + breakwould work, then the resizing check would happen for this message and this should be fixed.
Something that I find a bit curious though, why the msg = msg.next; is happening. Let me explain my train of thought:
- There is a msg and we get its size.
- This message has
sendPosition != 0and the buffer doesn't have enough space. - We can't resize the buffer currently, since there is still data in there. So we write the buffer and record the stats.
- The
sendPositionis reset to 0, and we can now re-use the buffer.
But, why does the msg = msg.next happen, I believe we wouldn't send the current message since it isn't contained in the original sent buffer?
Using a continue and not overwriting the msg with the next should fix both of these issues. Do you agree with this, or am I missing something?
while (msg != null) {
long size = msg.getSizeInBytes();
if (sendPosition + size > sendBuffer.length) {
if (sendPosition == 0) { // have to resize
this.sendBuffer = new byte[(int)Math.max(sendBuffer.length + size, sendBuffer.length * 2L)];
} else { // else send and go to next message
dataPort.write(sendBuffer, sendPosition);
connection.getNatsStatistics().registerWrite(sendPosition);
sendPosition = 0;
// go back to the start of the loop, to ensure we check for resizing the buffer again
continue;
}
}
Created a PR for this in advance :slightly_smiling_face:
Ok I think I see it now. We advance the msg to the next, but never send the one that would have been too big for the buffer. It probably doesn't happen much because of how often we run this code and the default buffer is plenty big, but on cases where the buffer was small or the message was really big, this can happen a lot.
Yeah, I believe changing this into the continue would solve this, by ensuring the re-size is checked again and that one message isn't lost.
@igor-komar @michelwilson There is a fix in the main branch and should be available in 2.16.1-SNAPSHOT on Sonatype
Small update from my end. Have not been able to reproduce this error reliably (if at all) on our setup.. Seems like it takes a really long time to get into this state, or is just "luck". However, will update our jnats version to 2.16.1-SNAPSHOT soon. The issue should be fixed then.
Another update, we've updated to version 2.16.1 yesterday and are still encountering this issue sadly.
11-Oct-2022 18:07:46.413 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.415 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.416 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.417 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.418 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.420 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.420 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.422 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
Will add the errorListener option, so we can inspect the full stacktrace.
I wonder if I should just switch to using the ByteArrayBuilder which just automatically expands and uses ByteBuffer under the covers.
@scottf Got a full stacktrace this time:
java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 65643 out of bounds for byte[65536]
at java.base/java.util.concurrent.FutureTask.report(Unknown Source)
at java.base/java.util.concurrent.FutureTask.get(Unknown Source)
at io.nats.client.impl.NatsConnection.tryToConnect(NatsConnection.java:381)
at io.nats.client.impl.NatsConnection.reconnect(NatsConnection.java:254)
at io.nats.client.impl.NatsConnection.closeSocket(NatsConnection.java:584)
at io.nats.client.impl.NatsConnection.lambda$handleCommunicationIssue$3(NatsConnection.java:542)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 65643 out of bounds for byte[65536]
at java.base/java.lang.System.arraycopy(Native Method)
at io.nats.client.impl.NatsConnectionWriter.sendMessageBatch(NatsConnectionWriter.java:145)
at io.nats.client.impl.NatsConnectionWriter.run(NatsConnectionWriter.java:186)
... 5 more
However, before this it seems like messages can't be sent anymore at all:
java.util.concurrent.CompletionException: java.util.concurrent.CancellationException: Future cancelled, response not registered in time, likely due to server disconnect.
at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source)
at io.nats.client.support.NatsRequestCompletableFuture.cancelTimedOut(NatsRequestCompletableFuture.java:31)
at io.nats.client.impl.NatsConnection.lambda$cleanResponses$6(NatsConnection.java:957)
at java.base/java.util.concurrent.ConcurrentHashMap.forEach(Unknown Source)
at io.nats.client.impl.NatsConnection.cleanResponses(NatsConnection.java:953)
at io.nats.client.impl.NatsConnection$2.run(NatsConnection.java:455)
at java.base/java.util.TimerThread.mainLoop(Unknown Source)
at java.base/java.util.TimerThread.run(Unknown Source)
Caused by: java.util.concurrent.CancellationException: Future cancelled, response not registered in time, likely due to server disconnect.
... 7 common frames omitted
Not quite sure what's the root cause. It seems like something is going wrong even prior to the out of bounds.
What can I do do reproduce this?
Not sure how to reproduce it yet. Currently trying to dig through the logs to get more clarity.
Looking a bit more in the logs for a timeline:
- UTC 19:36:57, messages start piling up
- UTC 19:37:16, first sign of a
CancellationException - UTC 19:43:11, first sign of a
ArrayIndexOutOfBoundsException
Our way of "reproducing" this so far was running it for a day or two until it eventually goes into this state, but that isn't really a reliable way :sweat:
@igor-komar @michelwilson This issue has been fixed in PR https://github.com/nats-io/nats.java/pull/820 and released in 2.16.6 Thanks to @MauriceVanVeen The core issues was the lazy nature of the message size calculation was not thread-safe. Lazy calculation was replaced with calculation upon construction.