StackExchange.Redis "master,fail" state not handled correctly

Hello! We saw instances where a node was in "master,fail" state but SE.Redis kept trying to connect to it ignoring that, although is marked as master, it is in failed state.

E.g. the cluster info command returned: ecffa36f58a103c199314291a68a66406195da01 20.212.157.81:15014 master,fail - 1662670083503 1662670080944 78 connected 20c41293cb0d5e1781c532e783b415bd32c2fcf8 20.212.157.81:13015 myself,master - 0 1662670077000 77 connected 2048-2340 7510-7802 10240-10532 12970-13182

but the library kept trying to connect to the "master,fail" node.

Sep 08 '22 23:09 lolodi

What's the scenario here - e.g. is it a configured endpoint, or are we discovering it?

Sep 09 '22 01:09 NickCraver

This is on a clustered cache with discovery.

Sep 14 '22 17:09 lolodi

Gotcha - what would the expected behavior here be?

It is intentional that we try to connect to the node because we're told it exists and we're monitoring for the moment it comes back online (in the background). There's also the possibility of a cluster going split brained and we wouldn't know to talk to the winning half if we didn't observe this (corner case, we hope).

Sep 15 '22 18:09 NickCraver

I think my expectation in this situation, where one node was in 'master,fail' and the other (the one that is actually answering) is 'myself,master' would be to try to failover to the one that says it's 'myself, master' and disconnect from the other one, especially since it's status says 'fail'. If both nodes of a shard are reporting as master, but one is fail and the other is not, I would expect the library to connect to the one not in fail state.

Sep 15 '22 18:09 lolodi

Connections are not per-shard though, they are per-server which has some number of shard responsibilities (which can also change on the fly). There can also me (and usually are) many masters in a cluster. We want to connect to what we're told is there as quickly as possible.

Overall though, this happens in the background and isn't meant to be noisy - what issue is it actually causing?

Sep 15 '22 18:09 NickCraver

We have had instances where the client kept the connection to the failed master for hours instead of switching over to the healthy one. Our understanding is that SE.Redis checks if the current node is still master, but doesn't verify if it also not in fail state. If a node dies suddenly, it might still be reported as "master, fail" and the client never tries to reconnect to a different one.

Sep 16 '22 00:09 lolodi

Got some time to look at it over break this week - agreed this isn't handled correctly and fixing in #2288 :)

Oct 28 '22 12:10 NickCraver