node-redis cluster slots not rediscovered on scale-down cluster

Description

#slots.rediscover() is correctly called when MOVED or ASK reply is received:

https://github.com/redis/node-redis/blob/6f79b49f731a1aaf57f42e16ea72774f062e57a1/packages/client/lib/cluster/index.ts#L271

https://github.com/redis/node-redis/blob/6f79b49f731a1aaf57f42e16ea72774f062e57a1/packages/client/lib/cluster/index.ts#L256

there is a problem / race-condition on scale-down cluster (rebalance hash slots away, forget node, shutdown):

#slots.rediscover() might never be called, if the leaving node has not been queried (after it's slots migrated away) before shutdown (hence it never got a chance to reply with MOVED and trigger a rediscover)

this results in connection closed errors, because node-redis cluster.#slots is keeping a client active for an outdated cluster topology.

steps to reproduce:

start a redis cluster
create node-redis client and connect
redis-cli reshard all slots away from a redis cluster node
redis-cli del-node the now empty node, then shutdown
node-redis client execute command for a key, which is sharded the (now shutdown) node
connection closed error

an option to fix this would be to provide an option for a slot refresh interval (ioredis does this too), will send a pull request

Node.js Version

No response

Redis Server Version

No response

Node Redis Version

No response

Platform

No response

Logs

No response

Jul 26 '24 11:07 carlhopf

Adding to this a scenario I've found: Following this guide https://redis.io/learn/operate/redis-at-scale/scalability/exercise-1 Initializing createCluster with six nodes in rootNodes being 7100-7105 (instead of 7000-7005, I have conflicting ports).

Logging all masters and replicas every second or so (while trying to access a certain key) with:

console.log(
    'masters',
    redisClient.masters.map(node => node.port),
    'replicas',
    redisClient.replicas.map(node => node.port)
);

Started my program while 7101 was down - it did not appear on the list (I then restarted 7101 but the list was not updated) - seems like this is caused by the issue above

masters (3) [7104, 7105, 7102] replicas (2) [7100, 7103]

At some point I shut down 7105 which caused an error event and the reconnect strategy to trigger

Redis disconnect. connect ECONNREFUSED 127.0.0.1:7105

After reconnecting 7105, now 7101 suddenly appears (it became a master following 7105 shutting down) but 7105 does NOT appear as a replica - It this scenario, a MOVED error would NOT occur and the list of replicas would therefore not be updated.

masters (3) [7104, 7101, 7102] replicas (2) [7100, 7103]

Then, I followed the rest of the guide and added 7106 and 7107 to the cluster and while running cluster reshard in the background (which inevitably lead to a MOVED error when accessing my key), both 7101 and 7105 suddenly appeared as a master/replica respectively, but 7106 and 7107 each appear 3 times.

masters (6) [7106, 7104, 7106, 7101, 7106, 7102] replicas (6) [7107, 7100, 7107, 7105, 7107, 7103]

I then restarted the application but the list is still as above with the duplication (probably because 7106 and 7107 were not listed on the list of rootNodes which seems like a separate but still alarming issue) but at least all actual nodes are specified.

In the event of having just replicas and not masters added or removed (scale up OR down), no MOVED error would probably occur, leading to the issue above.

I think a periodic interval refresh/rediscover as suggested above is a good idea.

Aug 28 '24 21:08 Davrosh

IMO a better solution would be to run "rediscover" proactively when a connection dies rather than having a "slots refresh interval" (which will solve the problem eventually, but will still have some "downtime"). I'm not sure if we should invest time into fixing it in v4 or just fix it in v5 (which should be released soon).. Anyway, PR's are more than welcome ;)

Sep 19 '24 11:09 leibale

@leibale do you know whether this was ever fixed in v5?

May 01 '25 08:05 edrose

We got hit by this recently when our Elasticache cluster scaled down automatically after a brief usage peak. Has anyone found a workaround to force the rediscovery?

Any idea if some something like this would work?

const reconnectStrategy = (retries, cause) => {
	// handle connection errors when nodes disappear from the cluster
	if (cause instanceof ConnectionTimeoutError) {
		if (this.client && this.client.isReady) {
			this.client._slots.rediscover();
		}
	}
	... default behaviour
};

Oct 16 '25 12:10 mgiova