oncrpc4j icon indicating copy to clipboard operation
oncrpc4j copied to clipboard

RPC call times out immediately

Open jcalcote opened this issue 7 months ago • 4 comments

We're using oncrpc4j version 3.3.2 (which, from what I can see from diffing the releases, is functionally identical to 3.3.3 - the latest).

here's some log snippet from our product for context:

2025-06-16 19:18:07.641 +0000,1837242509882 {} DEBUG c.p.d.m.MobilityControl:318 [msm-message-consumer-2-0] msmId=2-1750099925069: objId=2-2562: COW: SET_INSTANCES_COW_SPLIT: calling SICS (async) with pdfsSessionId=2370338415347, ...

2025-06-16 19:18:07.641 +0000,1837242675698 {} WARN c.p.d.m.MobilityControl:318 [msm-message-consumer-2-0] msmId=2-1750099925069: objId=2-2562: COW: SET_INSTANCES_COW_SPLIT: SICS (async) call failed: did not get a response within 5 MINUTES

The important part to note here is that we're calling an async RPC endpoint with a timeout of 5 minutes in the first line at 19:18:07.641. Literally within the same millisecond, we're getting a TimeoutException on the failed handler from the oncrpc4j library indicating "did not get a response within 5 MINUTES". This exception message attached to the TimeoutException comes from rpc/ReplyQueue here:

    public void registerKey(int xid, SocketAddress addr, CompletionHandler<RpcReply, RpcTransport> callback, final long timeout, final TimeUnit timeoutUnits) throws EOFException {
        ScheduledFuture<?> scheduledTimeout = null;
        if (timeout > 0 && timeoutUnits != null) {
            scheduledTimeout = executorService.schedule(() -> {
                CompletionHandler<RpcReply, RpcTransport> handler = get(xid);
                if (handler != null) { //means we're 1st, no response yet
                    handler.failed(new TimeoutException("did not get a response within " + timeout + " " + timeoutUnits), null);
                }
            }, timeout, timeoutUnits);
        }
        _queue.put(xid, new PendingRequest(addr, callback, scheduledTimeout));
    }

I don't really know how this can be happening - it would imply a bug in the scheduler I think, which I find hard to imagine, but the proof is clear. The message "did not get a response within" is clearly coming from oncrpc4j ReplyQueue code, and the timeout is clearly 5 minutes, based on the content of the same message. Yet, we're failing immediately. It's like schedule is running the event on the same thread inline (though that is not proven here).

[update] We're building our code against the 3.3.2 version of oncrpc4j library on maven central - we compile with Java 21.

[update 2] I've analyzed our logs and found that there were hundreds of legitimate 5 MINUTE timeouts in this session. Only this anomalous zero-delay "5 MINUTE" timeout occurred during the session. Submissions and time outs that occurred on either side if this event seemed to work perfectly.

jcalcote avatar Jun 16 '25 21:06 jcalcote

Hi @jcalcote, Thanks for the report. Indeed, the issue looks quite strange.

If I understand you correctly, this is a one-off shot, right? By looking at the code, it should be impossible 🤷‍♂ . Can you tell how many concurrent requests have been sent?

kofemann avatar Jun 18 '25 11:06 kofemann

Hi @kofemann - your assessment is correct - it's a one-off. Not happening regularly. I also got more information today - the receiver component (not mine) did not receive a request either. I would wonder if this is actually the root cause and the timeout is a red herring - except - that, as far as I can discern, the specific text of the timeout is only sent in a TimeoutException by the ReplyQueue scheduler when an actual timeout occurs. However, that said, I still think it's interesting to note that the server never received the request - I just don't know how related it is.

I can't say how many requests were outstanding - very likely a LOT - but we don't have metrics to track that (yet). My manager (who is pretty technical) is asking me to work around this so it never negatively impacts our code. His suggestion was that I look at wall time and compare with submission time (which I don't even store right now), and if the timeout has not fully elapsed, send the request again. I guess I'll have to do something like this if it's possible for oncrpc4j to fail early like this.

I agree with you - it seems not possible, and yet it did happen, and it's proven (by the messaging) that it was oncrpc4j that threw the exception. If you dig into it, I'll be happy to hear the results. Otherwise, I'm commanded to work around the possible failure.

[update] - I just looked again at the oncrpc4j ReplyQueue code and I now think I understand why the message was never sent. It's because the very first thing the timeout handler does is call get(xid) to retrieve the completion handler, and get(xid) removes the handler from the request queue before it's picked up and sent.

In fact, the timing is amazing really -

  1. scheduled a timeout
  2. added it to the request queue
  3. timeout fired: failed the request and removed it from the queue

The timeout had to be scheduled and the request added to the queue before the timeout fired, but all that happened (according to the log4j timestamp) within the same millisecond. (I'd be more inclined to believe the log4j timestamps are wrong - meaning, the log message queueing made it appear this all happened in the same millisecond.)

jcalcote avatar Jun 18 '25 21:06 jcalcote

I've come up with a theory that fits the facts. The person running the failed test uses a laptop and runs the test code in a vmware client or virtual box vm. He opens and closes his laptop all the time (as his daily schedule demands), suspending and waking up the process. It's possible therefore that this issue occurred due to a wall-time disjunction. Not sure I can convince him that this can't happen in a data center running on stable server quality equipment.

jcalcote avatar Jun 19 '25 15:06 jcalcote

Thanks for the update. I was speculating that the cause could be GC under memory pressure, however, the laptop in sleep mode sounds more realistic. Probably this can be tested. If thee is a loop that submits requests that runs on a laptop or vm, and then suspend the vm.

We run a production high-load NFS server, and the client code is used rarely. Only for pNFS layout recalls. Nonetheless, I have my doubts that ScheduledExecutor or related ReplyQueue section misbehaves.

kofemann avatar Jun 19 '25 16:06 kofemann