CCF icon indicating copy to clipboard operation
CCF copied to clipboard

Add tests and understand implications of rollback when new node is joining

Open jumaffre opened this issue 4 years ago • 2 comments

This has been discussed a few times now (see e.g. https://github.com/microsoft/CCF/issues/1014).

We should add some tests for rollback around the time a new node joins the service (for both 1tx and 2tx reconfiguration):

  1. When the initial addition of the new node is state Pending is rolled back
  2. When the proposal to trust the new node is rolled back

I believe the new node will poll for GET /node/join until it has seen itself as Trusted (not necessarily committed). However, if its transition to Trusted is rolled back after it has observed it (i.e. it's already catching up), the new node will be stuck and, if it has observed itself in the store, start triggering elections in a loop.

jumaffre avatar Nov 25 '21 15:11 jumaffre

  1. When the addition of a node added as Pending is rolled back, the new node will automatically re-add itself as Pending since it is the very same POST /node/join endpoint that is polled, so we should be fine here.
  2. When the new node has already been transitioned to Trusted by the members and has already retrieved the ledger secrets and become part of the consensus, it has currently no way to find out that it's no longer part of the consensus. The new node will stop receiving append entries from the current leader and if it has caught up enough, it may even try to stage an election when this happens.

To cater for 2., I see multiple options:

  • The POST /node/join endpoint only reads (globally) committed state, although I think this is awkward as we only make use of the get_globally_committed() KV API in one location currently.
  • The new node should wait until the TxID it receives in the response of its POST /node/join endpoint (once it's become Trusted) has become committed before initialising its consensus.

I believe the latest option is better as it puts this extra work on the new joiner rather than on the existing service and is also identical to how clients verify that their transaction was committed.

jumaffre avatar Jun 06 '22 14:06 jumaffre

After trying to implement this, it turns out that the two options described above are flawed because the new joiner node may be required to commit the transaction that adds it as trusted (e.g. adding a new joiner node to a one-node network).

Alternatively, the new node could check that, once it has seen itself in the configuration (i.e. it has started to "tick"), the TxID at which it observed its addition is committed within an election timeout. If the rollback happens before the node has seen itself in the configuration, then it should attempt to join again within an election timeout.

jumaffre avatar Jun 07 '22 16:06 jumaffre