CCF Raft extensions for omission faults

Our current implementation of CFT consensus still suffers from some limitations and liveness issues around omission faults (i.e. dropped messages). This is especially the case if one or several nodes are partitioned out of the service, as demonstrated by the end-to-end tests introduced in #2553: a single partitioned backup will automatically become candidate if it was partitioned for >= election_timeout.

Note that this is only true when no new write transactions are processed by the current leader. Otherwise, the partitioned node wouldn't be able to win an election as its last known seqno would be behind.

The following two extensions should help mitigate this family of issues:

1. PreVote

Each potential candidate should first check that a quorum of nodes would accept this node as the primary should it become candidate. Only then the node should transition to a candidate state and request votes from other nodes.

Other nodes should respond to PreVote messages as if it was a real election, but don't need to keep track of which nodes they have granted their PreVote. It is only when a quorum of nodes have responded positively to the PreVote round that the node can become candidate.

2. Leader stickiness/CheckQuorum

The goal here is to make sure that a primary stays primary for as long as possible, i.e. doesn't step down because one node only started an election.

Nodes should grant their PreVote/Votes if they haven't heard from a primary within their election timeout. As I understand it, this implies that a node should only grant PreVote/Votes when it is already in the new "campaign" (is this a good name for it?) or candidate state.

Moreover, a primary should actively step down (i.e. become a follower in the same term of its primary-ness) if it hasn't heard AppendEntries responses from a majority of backups within the election timeout.

Note that this also impacts the "sunny day" election scenario as the first half of the nodes whose election timeout expires wouldn't manage to get a quorum of nodes (because these ones still known about the current leader and haven't yet timed out). This is also a positive change as this would basically average out the election timeout of the service over a quorum of nodes rather than be set by the single node with the smallest election timeout.

Sources:

https://decentralizedthoughts.github.io/2020-12-12-raft-liveness-full-omission/
https://www.openlife.cc/sites/default/files/4-modifications-for-Raft-consensus.pdf

May 10 '21 15:05 jumaffre

This short workshop paper might be useful: https://dl.acm.org/doi/pdf/10.1145/3447851.3458739

Mar 21 '22 09:03 heidihoward

I agree that these might be worth considering. Essentially, liveness in Raft depends on establishing a stable leader who is able to make progress, who is not unduly forced to step down, and who can receive and reply to client requests.

Omission faults present 4 main issues for Raft:

No leader – repeated elections but no one leader is being elected. Could be due to vote splitting or the extra condition on voting (log up-to-date check)
Zombie leader – leader cannot make progress, for instance, because it is not connected to a majority quorum any longer.
Leader step down – a healthy leader is forced to step down. Could be due to nodes increasing their terms during a network partition which has now healed or an asymmetric partition where a node cannot hear from the leader but can force it to step down via a third node.
Clients cannot reach the current leader - either because they must connect to the leader directly and they are partitioned from it or if followers are allowed to forward client requests, then the follower may be partitioned from the leader.

As we discuss in the blog post, you do have to careful when modifying Raft for omission faults to not introduce new liveness issues elsewhere. PreVote and CheckQuorum are definitely worth considering, though they don't always cover the case where a link between nodes might drop arbitrary messages (a flaky link). For simplicity, the blog post assumes either an ideal link between nodes or no link at all.

Mar 22 '22 09:03 heidihoward

As discussed with @heidihoward, some points to take into consideration when implementation the CheckQuorum extension:

It has become apparent that having CheckQuorum would help operators monitor the health of a network (see https://github.com/microsoft/CCF/issues/3688).
It is OK to implement CheckQuorum in isolation/before implementing Pre-Vote (i.e. no apparent impact on liveness).
The primary node would step down as follower if it hasn't heard from a majority (i.e. f) of backup nodes for the duration of the election timeout (not randomised like on the backup nodes). Note that the stepping down primary should become follower in the same term as it previously was primary in (i.e. no increase in term), while keeping tracking of the votes it has already handed in in this term (i.e. for itself).
Note that without Pre-Vote, the primary node will soon become candidate in a new term if it doesn't hear from any other nodes (which is likely if this node was previously partitioned).
Primary nodes should treat ACKs and NACKs similarly. In other words, receiving a NACK is just as good as receiving a ACK from the CheckQuorum perspective.
We shouldn't assume that the latency of consensus messages between nodes is only due to network latency/partitions. We should also consider the time it takes for backup nodes to process a large batch of append entries. If a large batch is sent between the primary and a backup (e.g. because the backup has just joined with no snapshot), it is possible that the batch is so large that the primary considers this node as unresponsive. We should have a look at the append entries batching logic (see for example https://github.com/microsoft/CCF/issues/1451) to make sure it is capped to a reasonable number (and perhaps surface it as a configuration entry to operators?).
The TLA+ model does not currently check for liveness property, which CheckQuorum very much tries to address. Testing feature should thus make use of the existing Raft scenarios and partition end-to-end tests.

Also to explore: To reduce service unavailability, it may be nice for the primary node to tell the other nodes when it steps down so that other nodes don't wait for their election timeout to expire to try to elect a new primary node.

Apr 11 '22 14:04 jumaffre

Leader stickiness is interesting independently from PreVote, because it gives some defense against another node's clock running fast.

May 15 '23 09:05 eddyashton

Adding tla+ tag as it might be worth spec'ing this change out before implementing it

May 15 '23 12:05 heidihoward