Add tests and understand implications of rollback when new node is joining
This has been discussed a few times now (see e.g. https://github.com/microsoft/CCF/issues/1014).
We should add some tests for rollback around the time a new node joins the service (for both 1tx and 2tx reconfiguration):
- When the initial addition of the new node is state
Pendingis rolled back - When the proposal to trust the new node is rolled back
I believe the new node will poll for GET /node/join until it has seen itself as Trusted (not necessarily committed). However, if its transition to Trusted is rolled back after it has observed it (i.e. it's already catching up), the new node will be stuck and, if it has observed itself in the store, start triggering elections in a loop.
- When the addition of a node added as
Pendingis rolled back, the new node will automatically re-add itself asPendingsince it is the very samePOST /node/joinendpoint that is polled, so we should be fine here. - When the new node has already been transitioned to
Trustedby the members and has already retrieved the ledger secrets and become part of the consensus, it has currently no way to find out that it's no longer part of the consensus. The new node will stop receiving append entries from the current leader and if it has caught up enough, it may even try to stage an election when this happens.
To cater for 2., I see multiple options:
- The
POST /node/joinendpoint only reads (globally) committed state, although I think this is awkward as we only make use of theget_globally_committed()KV API in one location currently. - The new node should wait until the
TxIDit receives in the response of itsPOST /node/joinendpoint (once it's becomeTrusted) has become committed before initialising its consensus.
I believe the latest option is better as it puts this extra work on the new joiner rather than on the existing service and is also identical to how clients verify that their transaction was committed.
After trying to implement this, it turns out that the two options described above are flawed because the new joiner node may be required to commit the transaction that adds it as trusted (e.g. adding a new joiner node to a one-node network).
Alternatively, the new node could check that, once it has seen itself in the configuration (i.e. it has started to "tick"), the TxID at which it observed its addition is committed within an election timeout. If the rollback happens before the node has seen itself in the configuration, then it should attempt to join again within an election timeout.