Reconnect Strategy doesn't work if connection is lost
Description
I'm having some issues with the reconnect strategy in this library. It seems to work as per the documentation on the initial connection to Redis, however the behaviour changes if the connection to Redis is lost at a later stage.
The documentation states that there are three different types that can be returned:
- A
numbercan be returned to indicate that the client should retry after the specified delay - An error can be returned to terminate the connection
-
falsecan be returned to prevent reconnection and hold the client in some kind of broken state for testing
Behaviour on Initial Connection
- The
RedisSocket.connect()function is called - This calls the private function
RedisSocket.#connect()which deals with retries - There is a line in packages/client/lib/client/socket.ts:218 where the output of the reconnect strategy function is evaluated - if it's a number then a retry is scheduled; if it's not a number then it throws whatever it is.
- Assuming the value is not
false(which I assume is a noop in javascript) it bubbles up to the initial call toRedisSocket.connect()and the promise called by the library user rejects with the error
This is the behaviour that is documented.
Behaviour when connection is lost
- The socket
"error"event fires, which calls intoRedisSocket.#onSocketError() -
RedisSocket.#onSocketError()re-emits the error event and also emits"reconnecting"before callingthis.#connect()on line 269 - Crucially, that line functionally looks like this:
this.#connect().catch(() => {}));
Any error in connecting is completely ignored. This means that is has the same effect as if the reconnect strategy function returned false which (quoting the documentation here) "should only be used for testing purposes". This silently leaves the client in a broken state.
What I expect to happen
At the very least expect the "error" event to be emitted with the error returned by the reconnect strategy. Even better would be for a "terminated" or similar signal to be emitted to tell the library that the retry strategy failed and allow it to do it's own recovery.
Node.js Version
18.20.6
Redis Server Version
7.3.241
Node Redis Version
5.0.1
Platform
Linux
Logs
@edrose I dont see the issue here. Even though this.#connect().catch(() => {})) swallows the error, #connect will still emit an error event, which will in turn trigger #onSocketError, continuing in a loop until the reconnection strategy returns something different than a number
True, internally there is nothing wrong with the logic - it does exactly what it's told to by the reconnection strategy. However as a user of the library I need to be able to detect when the library has given up retrying, otherwise I just have a dead client.
Listening for 'error' events isn't sufficient because those don't tell me if the client has given up retrying or not. If it has given up then further logic is needed to re-initialise the client, maybe re-fetch some configuration or even do a full restart.
The context is that we have a webapp and Redis running in a Kubernetes cluster on AWS spot instances. Every so often we get spot interruptions that cause our Redis instances to move to different nodes. When this occurs they go down, then come back up with a new IP address. Since we have replication this can occur with no downtime, but it does cause error events to be emitted whenever it happens, even if that doesn't cause an interruption in service.
To detect and recover from a persistent error I need to be able to tell the difference between "an error occurred and I'm going to try to fix it" and "an error occurred and I can't do anything more".
@edrose AFAIK the only thing that controls when the library will choose to stop retrying is the reconnectStrategy, which is controlled by the user. So, you have to return false or Error from your custom reconnectStrategy. At that point, you could invoke any other custom logic you wish. Here is a minimal example:
import { GenericContainer } from 'testcontainers';
import { createClient } from 'redis';
const container = await new GenericContainer('redis')
.withExposedPorts(6379)
.start();
const makeCappedRetryStrategy = (maxRetries: number, cleanup: () => void) => {
return (retries: number, cause: Error) => {
console.log(`retrying... ${retries}`);
if (retries >= maxRetries) {
cleanup();
return false;
}
return 1000;
}
}
const client = createClient({
url: `redis://localhost:${container.getMappedPort(6379)}`,
socket: {
reconnectStrategy: makeCappedRetryStrategy(3, () => {
console.log('cleanup function called');
})
}
});
client.on('error', () => {});
await client.connect();
await container.exec('redis-cli shutdown');
In principal, Im not against firing a 'terminated' event, but fundamentally, the behavior is controlled by the reconnectStrategy.
@edrose what do you think?
I don't think that code works when .duplicate() is used. So if duplicate clients are created for pub/sub then the reconnect strategy will also duplicate, which duplicates whatever work is done when the reconnect strategy function fails. There isn't any way of knowing which client called the callback function, so extra care must be taken to ensure that extra logic isn't called multiple times.
I think the pattern of using the side-effects of a callback function is also just asking for bugs to occur, and I would never do extra work in the callback unless the documentation explicitly states that it's acceptable. As a user of the library I don't know what state the Redis client is in when it calls the callback, so I wouldn't assume that it's safe to modify the state of the client whilst in that callback context.
As a bare minimum I would be using process.nextTick() to do any extra logic in the callback if I was forced to, but I still think it would be much cleaner and safer if an event is emitted.
I see your point. I think its sensible to add the event. @bobymicroby do you see any issues for us emitting a 'terminated' event on client when reconnectStrategy returns false ( which is when we kill the socket and client is unusable from that point on ) ?
The code should work when duplicate is used as every client gets its own "retry" function instance. If it doesn't, it's a bug we missed, but from what I see there shouldn't be a problem. The retryStrategy function that the user supplies has no side effects and is pure. It receives the strategy(retries, cause) - retries so far and the latest error - so you are free to make informed decisions and to decide if you want to keep retrying or not based on that.
What we can do ( just thinking out loud, not sure if it makes sense ) is wrap the last error in RetryStrategyExhausted(originalError) before emitting it here: https://github.com/bobymicroby/node-redis/blob/065eb5e9145db9152923a44b63a356c55f76cf3b/packages/client/lib/client/socket.ts#L192. Unfortunately this can break clients which expect the raw error and will need more consideration.
It is also worth to note that while using redis cluster if the connection is lost rediscovery never happens, so we end up with a stale connection. Which is causing ClientClosedError: The client is closed. Would be nice to do something with it.