Single node down when joining and leaving will cause total outage
If you have 3 ingesters, and replication factor 3, then a rolling update will give this error:
at least 3 live ingesters required, could only find 2
This is because the distributor increases the size of the quorum required when it finds ingesters joining and leaving.
Strikes me there is something wrong here.
I'm facing the same problem
#1488
Same issue hits when you have more than 3 ingesters, one is LEAVING, and another one goes bad. Distributors will fail the entire request back to the sender with a 500 code because it can only find 2 ingesters for some subset of the series.
I think if this happened when extended write was set to true, this issue was fixed by https://github.com/cortexproject/cortex/issues/4626
#4626 is another issue; did you mean #4636?