Add mechanism to increase efficiency of reconnection process after upgrade
In 1.6 and onward there is a case resulting in a higher-than-optimal time or number of attempts to reestablish a connection. This was found in testing:
The minor potential issue is timing in general - tests use fast block times and node-kill-timers, if I am not mistaken? If so, the backoff timers might be too long to allow the system to work reliably. What you may be running into is what I call the “upgrade problem”, since I expect the same scenario to potentially occur on an upgrade: When a node is killed/restarted, its node ID changes randomly as well as connections timing out. Here’s the scenario from Alice’s perspective where we go from Alice > Cecil to Cecil > Alice:
- Alice loses connection to Cecil.
- Alice fails to reconnect immediately while Cecil boots.
- Eventually, within the reconnect delay, Cecil boots up.
- Alice is in Cecil’s known_addresses, so Cecil connects to Alice.
- An incoming connection is set up, all is good.
- Alice connects to Cecil, notices the low connection rank, blocks Cecil’s socket address (for outgoing) for 10 minutes.
This is the “good” case, since the connection rank inverted, we were able to establish a connection quickly. However, what if Cecil goes down again after 1 minute and we flip the connection rank again, so that Alice > Cecil again?
- Alice loses the incoming connection from Cecil. It does not care, since it is an inbound connection.
- Cecil reboots, establishes a connection to Alice due to known_addresses.
- Both notice the low rank of the connection, but Alice learns of Cecil’s address.
- However, this time around, when learning Cecil’s address it is still on the do-not-call list.
- For another 9 minutes, we cannot establish a connection between these two nodes.
You can verify this by running the test again with permanent_error_backoff set to 1 or 5 seconds, I recommend doing that. Potential fixes:
- Remove entries from the do-not-call list if the addresses are given to us by Magic Mike. This is a potential security issue though, as we can never verify the connection between a NodeId and a SocketAddr. A clever attacker could cause a node to get itself firewalled from every good validator this way.
- Lower the permanent_error_backoff -- not great, since it increases reconnection spam.
- Persist the NodeId of a node across restarts. This is likely the best approach and what we had 3 years ago, but more work.
- Implement the gossip-less address syncing that I have in mind, to avoid having to have such long backoff timers.
- Ignore it, since it will only present itself in arbitrary testing situations (unless a node is in a crash loop).
The achilles heel of the system is the fact that we need to connect to unverified socket addresses. The original system used to sign SocketAddr, NodeId pairs for that reason.However, I encourage you to test with permanent_error_backoff set to 1 second to test this hypothesis at least.