foundationdb icon indicating copy to clipboard operation
foundationdb copied to clipboard

Improve FDB reliability when localities are misconfigured and later corrected

Open xumengpanda opened this issue 6 years ago • 3 comments

If a storage server (SS) does not have a valid locality due to misconfiguration, the replication policy can select replicas that do not satisfy the policy. For example, in three_data_hall mode, if a server is not configured with data_hall locality, the selectReplicas() may create a server team whose size is not equal to the replica factor.

Another issue is that a SS may be chosen as the preferred server and feed into selectReplicas(), although having the SS will never create a valid team. For example, in three_data_center mode, if a DC has only one server and the server is chosen as the must-have one for a team. selectReplicas() will not be able to create such a team. addTeamsBest() may get stuck there.

Although this only happens in misconfiguration, DD should better prevent itself from the problem using the following solution:

  1. If a SS does not have a valid locality configuration under a replication policy, it should not be used in building teams -- it should be treated as always unhealthy. It should also create a trace event to notify the system operator;
  2. We should add test cases in simulation to cover these situations: three_data_hall mode, and three_data_center mode.

xumengpanda avatar Sep 12 '19 23:09 xumengpanda

This issue turns out to be a bigger issue. It has the following items:

  • [ ] Fixing DD's misconfigured locality issue;
  • [ ] Adding the simulation test for the misconfigured locality issue;
  • [ ] Fixing tLog's misconfigured locality issue;
  • [ ] Reducing the latency of recovery from misconfigured locality issue;
  • [ ] Report the addresses of misconfigured localities in fdbcli status;

---Long-term problem to solve---

  • [ ] Ensure misconfigure a process will not make a live cluster unavailable, even temporarily. The above only ensures cluster will be alive eventually, but it may not guarantee the cluster will not experience blip.

xumengpanda avatar Sep 17 '19 19:09 xumengpanda

I also think we should consider adding something to report the presence of processes with misconfigured locality in status. Otherwise it may not be obvious that something isn't quite right.

ajbeamon avatar Sep 18 '19 15:09 ajbeamon

DD:Defend DD from misconfigured locality of servers -- Part 2 #2110 was closed as the work has been dropped for a while and won't be making it into 7.0, but we should still consider resuming this work for 7.1.

alexmiller-apple avatar Jan 25 '20 01:01 alexmiller-apple