cluster slots not rediscovered on scale-down cluster
Description
#slots.rediscover() is correctly called when MOVED or ASK reply is received:
https://github.com/redis/node-redis/blob/6f79b49f731a1aaf57f42e16ea72774f062e57a1/packages/client/lib/cluster/index.ts#L271
https://github.com/redis/node-redis/blob/6f79b49f731a1aaf57f42e16ea72774f062e57a1/packages/client/lib/cluster/index.ts#L256
there is a problem / race-condition on scale-down cluster (rebalance hash slots away, forget node, shutdown):
#slots.rediscover() might never be called, if the leaving node has not been queried (after it's slots migrated away) before shutdown (hence it never got a chance to reply with MOVED and trigger a rediscover)
this results in connection closed errors, because node-redis cluster.#slots is keeping a client active for an outdated cluster topology.
steps to reproduce:
- start a redis cluster
- create node-redis client and connect
- redis-cli reshard all slots away from a redis cluster node
- redis-cli del-node the now empty node, then shutdown
- node-redis client execute command for a key, which is sharded the (now shutdown) node
-
connection closederror
an option to fix this would be to provide an option for a slot refresh interval (ioredis does this too), will send a pull request
Node.js Version
No response
Redis Server Version
No response
Node Redis Version
No response
Platform
No response
Logs
No response
Adding to this a scenario I've found:
Following this guide https://redis.io/learn/operate/redis-at-scale/scalability/exercise-1
Initializing createCluster with six nodes in rootNodes being 7100-7105 (instead of 7000-7005, I have conflicting ports).
Logging all masters and replicas every second or so (while trying to access a certain key) with:
console.log(
'masters',
redisClient.masters.map(node => node.port),
'replicas',
redisClient.replicas.map(node => node.port)
);
Started my program while 7101 was down - it did not appear on the list (I then restarted 7101 but the list was not updated) - seems like this is caused by the issue above
masters (3) [7104, 7105, 7102] replicas (2) [7100, 7103]
At some point I shut down 7105 which caused an error event and the reconnect strategy to trigger
Redis disconnect. connect ECONNREFUSED 127.0.0.1:7105
After reconnecting 7105, now 7101 suddenly appears (it became a master following 7105 shutting down) but 7105 does NOT appear as a replica - It this scenario, a MOVED error would NOT occur and the list of replicas would therefore not be updated.
masters (3) [7104, 7101, 7102] replicas (2) [7100, 7103]
Then, I followed the rest of the guide and added 7106 and 7107 to the cluster and while running cluster reshard in the background (which inevitably lead to a MOVED error when accessing my key), both 7101 and 7105 suddenly appeared as a master/replica respectively, but 7106 and 7107 each appear 3 times.
masters (6) [7106, 7104, 7106, 7101, 7106, 7102] replicas (6) [7107, 7100, 7107, 7105, 7107, 7103]
I then restarted the application but the list is still as above with the duplication (probably because 7106 and 7107 were not listed on the list of rootNodes which seems like a separate but still alarming issue) but at least all actual nodes are specified.
In the event of having just replicas and not masters added or removed (scale up OR down), no MOVED error would probably occur, leading to the issue above.
I think a periodic interval refresh/rediscover as suggested above is a good idea.
IMO a better solution would be to run "rediscover" proactively when a connection dies rather than having a "slots refresh interval" (which will solve the problem eventually, but will still have some "downtime"). I'm not sure if we should invest time into fixing it in v4 or just fix it in v5 (which should be released soon).. Anyway, PR's are more than welcome ;)
@leibale do you know whether this was ever fixed in v5?
We got hit by this recently when our Elasticache cluster scaled down automatically after a brief usage peak. Has anyone found a workaround to force the rediscovery?
Any idea if some something like this would work?
const reconnectStrategy = (retries, cause) => {
// handle connection errors when nodes disappear from the cluster
if (cause instanceof ConnectionTimeoutError) {
if (this.client && this.client.isReady) {
this.client._slots.rediscover();
}
}
... default behaviour
};