nats.java icon indicating copy to clipboard operation
nats.java copied to clipboard

ArrayIndexOutOfBoundsException on attempt to reconnect

Open igor-komar opened this issue 3 years ago • 12 comments

Defect

We observed one of our Nats clients in our production enviroment flooded log with the following exceptions.

  1. First 40 minutes it was throwing the following exception during publish() call (messages consumption was healthy):
java.lang.IllegalStateException: Output queue is full 5000
	at io.nats.client.impl.MessageQueue.push(MessageQueue.java:124)
	at io.nats.client.impl.MessageQueue.push(MessageQueue.java:110)
	at io.nats.client.impl.NatsConnectionWriter.queue(NatsConnectionWriter.java:209)
	at io.nats.client.impl.NatsConnection.queueOutgoing(NatsConnection.java:1353)
	at io.nats.client.impl.NatsConnection.publishInternal(NatsConnection.java:765)
	at io.nats.client.impl.NatsConnection.publish(NatsConnection.java:733)
  1. Then both publishing and consumption didn't work for another 40 minutes till service was terminated. At this time log was flooded with the following exceptions (each time with the same last destination index and byte index):
java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 175759 out of bounds for byte[131555]
	at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
	at io.nats.client.impl.NatsConnection.tryToConnect(NatsConnection.java:380)
	at io.nats.client.impl.NatsConnection.reconnect(NatsConnection.java:254)
	at io.nats.client.impl.NatsConnection.closeSocket(NatsConnection.java:583)
	at io.nats.client.impl.NatsConnection.lambda$handleCommunicationIssue$3(NatsConnection.java:541)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 175759 out of bounds for byte[131555]
	at java.base/java.lang.System.arraycopy(Native Method)
	at io.nats.client.impl.NatsConnectionWriter.sendMessageBatch(NatsConnectionWriter.java:147)
	at io.nats.client.impl.NatsConnectionWriter.run(NatsConnectionWriter.java:188)
	... 5 common frames omitted

Our client configuration:

        Options options = new Options.Builder()
                .servers(natsBootstrapServers.addresses)
                .errorListener(errorListener)
                .connectionTimeout(Duration.ofSeconds(5))
                .maxReconnects(-1) // -1 = Infinite
                .connectionName(getHostName())
                .pingInterval(Duration.ofSeconds(2))
                .build();

We don't have a reproduction scenario yet.

Versions of io.nats:jnats and nats-server:

nats server v 2.7.4 jnats v 2.14.1

OS/Container environment:

adoptopenjdk/openjdk11:jdk-11.0.8_10-alpine

Steps or code to reproduce the issue:

Expected result:

Actual result:

igor-komar avatar May 16 '22 08:05 igor-komar

Linux core version 5.4.170

igor-komar avatar May 16 '22 08:05 igor-komar

@igor-komar You say "message consumption was healthy", what does this mean? If the output queue is full, it seems that it's having difficulty sending those messages to the server in a timely fashion. How big are your message payloads?

scottf avatar Jun 08 '22 14:06 scottf

@scottf "message consumption was healthy" means that messages receiving was NOT interrupted on the service, only publishing didn't work. Incoming/outgoing messages have average size of ~1KB. Max size is not metered, I think it's below 50KB.

igor-komar avatar Jun 08 '22 15:06 igor-komar

Is the consumption on the same connection as the production? I'm trying to understand why you noted consumption. I recognize that something bad is happening when publish is stopped because of a full output queue, but am really trying to understand how the queue got full in the first place.

scottf avatar Jun 08 '22 16:06 scottf

Yes, consumption is on the same connection as the production.

but am really trying to understand how the queue got full in the first place.

It was the problem of exactly this affected instance. Other instances of the same service were consuming and publishing without problems. I guess this might be a result of race condition.

igor-komar avatar Jun 08 '22 17:06 igor-komar

I also encountered this exact issue. I did not manage to capture a stack trace, somehow it got swallowed, not sure why. First, we get the queue full error: 17-Jun-2022 03:35:39.473 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 23, Exception: java.io.IOException: java.lang.IllegalStateException: Output queue is full 5000 And right after that 17-Jun-2022 03:35:39.489 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 23, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 65654 out of bounds for byte[65536] And this last error seems to be repeated continuously, until I restart the application.

Relevant version numbers:

  • jnats 2.14.0
  • server 2.8.0

Edit: also relevant of course, the client was publishing to an on-disk jetstream, with this config:

val config = StreamConfiguration.builder()
            .name(stream)
            .subjects(subjects)
            .retentionPolicy(RetentionPolicy.Limits)
            .maxAge(maxAge)
            .storageType(StorageType.File)
            .replicas(3)
            .build()

Publishing was done using publishAsync.

michelwilson avatar Jun 17 '22 06:06 michelwilson

@scottf, I have been looking into the code and I believe I've found what goes wrong in NatsConnectionWriter.sendMessageBatch (related to the ArrayIndexOutOfBoundsException):

  • The exception gets thrown when the sendBuffer has a smaller size, so the current message doesn't fit.
  • In the logic for resizing the sendBuffer, there is a check if it's the first item, if so the buffer is re-sized. Otherwise the buffer is sent, and we reset the sendPosition.

However, I believe two things go wrong for the following piece of code in NatsConnectionWriter.sendMessageBatch:

while (msg != null) {
    long size = msg.getSizeInBytes();
    
    if (sendPosition + size > sendBuffer.length) {
        if (sendPosition == 0) { // have to resize
            this.sendBuffer = new byte[(int)Math.max(sendBuffer.length + size, sendBuffer.length * 2L)];
        } else { // else send and go to next message
            dataPort.write(sendBuffer, sendPosition);
            connection.getNatsStatistics().registerWrite(sendPosition);
            sendPosition = 0;
            msg = msg.next;
    
            if (msg == null) {
                break;
            }
        }
    }
    
    // ... use System.arraycopy, etc.

    msg = msg.next;
}
  • if msg == null we break out of the loop (which is correct), but if there is a message than the resizing of the sendBuffer isn't checked. That will cause the System.arraycopy(...) to throw an error, since the buffer can't accommodate the size of the new message. Changing this into a continue instead of the if + break would work, then the resizing check would happen for this message and this should be fixed.

Something that I find a bit curious though, why the msg = msg.next; is happening. Let me explain my train of thought:

  • There is a msg and we get its size.
  • This message has sendPosition != 0 and the buffer doesn't have enough space.
  • We can't resize the buffer currently, since there is still data in there. So we write the buffer and record the stats.
  • The sendPosition is reset to 0, and we can now re-use the buffer.

But, why does the msg = msg.next happen, I believe we wouldn't send the current message since it isn't contained in the original sent buffer?

Using a continue and not overwriting the msg with the next should fix both of these issues. Do you agree with this, or am I missing something?

while (msg != null) {
    long size = msg.getSizeInBytes();

    if (sendPosition + size > sendBuffer.length) {
        if (sendPosition == 0) { // have to resize
            this.sendBuffer = new byte[(int)Math.max(sendBuffer.length + size, sendBuffer.length * 2L)];
        } else { // else send and go to next message
            dataPort.write(sendBuffer, sendPosition);
            connection.getNatsStatistics().registerWrite(sendPosition);
            sendPosition = 0;

            // go back to the start of the loop, to ensure we check for resizing the buffer again
            continue;
        }
    }

MauriceVanVeen avatar Sep 26 '22 08:09 MauriceVanVeen

Created a PR for this in advance :slightly_smiling_face:

MauriceVanVeen avatar Sep 26 '22 09:09 MauriceVanVeen

Ok I think I see it now. We advance the msg to the next, but never send the one that would have been too big for the buffer. It probably doesn't happen much because of how often we run this code and the default buffer is plenty big, but on cases where the buffer was small or the message was really big, this can happen a lot.

scottf avatar Sep 27 '22 20:09 scottf

Yeah, I believe changing this into the continue would solve this, by ensuring the re-size is checked again and that one message isn't lost.

MauriceVanVeen avatar Sep 27 '22 20:09 MauriceVanVeen

@igor-komar @michelwilson There is a fix in the main branch and should be available in 2.16.1-SNAPSHOT on Sonatype

scottf avatar Sep 29 '22 16:09 scottf

Small update from my end. Have not been able to reproduce this error reliably (if at all) on our setup.. Seems like it takes a really long time to get into this state, or is just "luck". However, will update our jnats version to 2.16.1-SNAPSHOT soon. The issue should be fixed then.

MauriceVanVeen avatar Oct 05 '22 15:10 MauriceVanVeen

Another update, we've updated to version 2.16.1 yesterday and are still encountering this issue sadly.

11-Oct-2022 18:07:46.413 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.415 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.416 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.417 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.418 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.420 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.420 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]
11-Oct-2022 18:07:46.422 ERROR [pool-2-thread-1] io.nats.client.impl.ErrorListenerLoggerImpl: exceptionOccurred, Connection: 119, Exception: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 66137 out of bounds for byte[65536]

Will add the errorListener option, so we can inspect the full stacktrace.

MauriceVanVeen avatar Oct 11 '22 18:10 MauriceVanVeen

I wonder if I should just switch to using the ByteArrayBuilder which just automatically expands and uses ByteBuffer under the covers.

scottf avatar Oct 11 '22 20:10 scottf

@scottf Got a full stacktrace this time:

java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 65643 out of bounds for byte[65536]
	at java.base/java.util.concurrent.FutureTask.report(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.get(Unknown Source)
	at io.nats.client.impl.NatsConnection.tryToConnect(NatsConnection.java:381)
	at io.nats.client.impl.NatsConnection.reconnect(NatsConnection.java:254)
	at io.nats.client.impl.NatsConnection.closeSocket(NatsConnection.java:584)
	at io.nats.client.impl.NatsConnection.lambda$handleCommunicationIssue$3(NatsConnection.java:542)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ArrayIndexOutOfBoundsException: arraycopy: last destination index 65643 out of bounds for byte[65536]
	at java.base/java.lang.System.arraycopy(Native Method)
	at io.nats.client.impl.NatsConnectionWriter.sendMessageBatch(NatsConnectionWriter.java:145)
	at io.nats.client.impl.NatsConnectionWriter.run(NatsConnectionWriter.java:186)
	... 5 more

However, before this it seems like messages can't be sent anymore at all:

java.util.concurrent.CompletionException: java.util.concurrent.CancellationException: Future cancelled, response not registered in time, likely due to server disconnect.
    at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source)
    at io.nats.client.support.NatsRequestCompletableFuture.cancelTimedOut(NatsRequestCompletableFuture.java:31)
    at io.nats.client.impl.NatsConnection.lambda$cleanResponses$6(NatsConnection.java:957)
    at java.base/java.util.concurrent.ConcurrentHashMap.forEach(Unknown Source)
    at io.nats.client.impl.NatsConnection.cleanResponses(NatsConnection.java:953)
    at io.nats.client.impl.NatsConnection$2.run(NatsConnection.java:455)
    at java.base/java.util.TimerThread.mainLoop(Unknown Source)
    at java.base/java.util.TimerThread.run(Unknown Source)
Caused by: java.util.concurrent.CancellationException: Future cancelled, response not registered in time, likely due to server disconnect.
... 7 common frames omitted

Not quite sure what's the root cause. It seems like something is going wrong even prior to the out of bounds.

MauriceVanVeen avatar Oct 12 '22 20:10 MauriceVanVeen

What can I do do reproduce this?

scottf avatar Oct 12 '22 20:10 scottf

Not sure how to reproduce it yet. Currently trying to dig through the logs to get more clarity.

Looking a bit more in the logs for a timeline:

  • UTC 19:36:57, messages start piling up
  • UTC 19:37:16, first sign of a CancellationException
  • UTC 19:43:11, first sign of a ArrayIndexOutOfBoundsException

MauriceVanVeen avatar Oct 12 '22 20:10 MauriceVanVeen

Our way of "reproducing" this so far was running it for a day or two until it eventually goes into this state, but that isn't really a reliable way :sweat:

MauriceVanVeen avatar Oct 12 '22 20:10 MauriceVanVeen

@igor-komar @michelwilson This issue has been fixed in PR https://github.com/nats-io/nats.java/pull/820 and released in 2.16.6 Thanks to @MauriceVanVeen The core issues was the lazy nature of the message size calculation was not thread-safe. Lazy calculation was replaced with calculation upon construction.

scottf avatar Jan 20 '23 10:01 scottf