stellar-core icon indicating copy to clipboard operation
stellar-core copied to clipboard

Harden leader election to unresponsive nodes and asymmetric quorums

Open bboston7 opened this issue 1 year ago • 0 comments

There are a couple scenarios that can lead to elevated nomination timeouts:

  1. Slow or missing validators that do not respond when they are elected as the round leader, and
  2. Asymmetric quorums leading to disagreements about the leader.

One way to solve (1) is to track the responsiveness of other validators to nomination over some window of time and fast-timeout when picking an unresponsive node. By using a fast timeout, the network will elect 2+ leaders: the unresponsive node(s) and a responsive node. The tradeoff is that there may be multiple values on the network when the unresponsive node(s) become responsive again. However, this will resolve itself once those nodes are marked responsive again.

This solution may also indirectly help with situation (2) as asymmetric quorums will lead to some nodes with differing quorum sets being marked unresponsive, and therefore multiple nodes will win leader election and the probability of the network agreeing on a leader will increase. However, it's worth discussing whether a more direct solution exists here.

bboston7 avatar Jan 22 '25 19:01 bboston7