Improve FDB reliability when localities are misconfigured and later corrected
If a storage server (SS) does not have a valid locality due to misconfiguration, the replication policy can select replicas that do not satisfy the policy. For example, in three_data_hall mode, if a server is not configured with data_hall locality, the selectReplicas() may create a server team whose size is not equal to the replica factor.
Another issue is that a SS may be chosen as the preferred server and feed into selectReplicas(), although having the SS will never create a valid team. For example, in three_data_center mode, if a DC has only one server and the server is chosen as the must-have one for a team. selectReplicas() will not be able to create such a team. addTeamsBest() may get stuck there.
Although this only happens in misconfiguration, DD should better prevent itself from the problem using the following solution:
- If a SS does not have a valid locality configuration under a replication policy, it should not be used in building teams -- it should be treated as always unhealthy. It should also create a trace event to notify the system operator;
- We should add test cases in simulation to cover these situations:
three_data_hallmode, andthree_data_centermode.
This issue turns out to be a bigger issue. It has the following items:
- [ ] Fixing DD's misconfigured locality issue;
- [ ] Adding the simulation test for the misconfigured locality issue;
- [ ] Fixing tLog's misconfigured locality issue;
- [ ] Reducing the latency of recovery from misconfigured locality issue;
- [ ] Report the addresses of misconfigured localities in fdbcli status;
---Long-term problem to solve---
- [ ] Ensure misconfigure a process will not make a live cluster unavailable, even temporarily. The above only ensures cluster will be alive eventually, but it may not guarantee the cluster will not experience blip.
I also think we should consider adding something to report the presence of processes with misconfigured locality in status. Otherwise it may not be obvious that something isn't quite right.
DD:Defend DD from misconfigured locality of servers -- Part 2 #2110 was closed as the work has been dropped for a while and won't be making it into 7.0, but we should still consider resuming this work for 7.1.