Connections are dropped, even if peers are added to reserved nodes.
Issue: Connections are dropped, even if peers are added to reserved nodes.
Context: we are AlephZero and develop a substrate node with our custom finality gadget AlephBFT. We run at low block-time (1sec) and with AURA. Otherwise our spec is pretty standard. We currently depend on the polkadot-v0.9.13 branch.
We wrote our own layer of network on top of substrate network to ensure direct connection between validators. Our network has two protocols/peersets: Generic and Validator. First one allows communication and gossip between every two nodes (no matter what is their role). If they authenticate themselves in Generic protocol as validators for some session, they are added to Validator peerset as reserved nodes.
Recently, during tests of connecting 104 nodes (100 non-validators and 4 validators) we discovered that even if we always connect every possible peer as reserved, some of them cannot keep the connections. On the other hand, syncing blocks works so it shows that the default substrate protocol works.
I'm attaching 2 grep's of validator logs (Grep for every other peer is either healthy or looks exactly like one of those below):
Logs for 1 show that we are trying to connect such nodes around ~50 times. Logs for 2 are the weird ones, looks like we connected to some peers, only to disconnect and never retry. Those conclusions are based on lines from the “sub-libp2p” target.
Since validators cannot connect to each other in Generic peerset, they are never connecting to Validator peerset, which results in finalization not starting at all.
When we discovered this issue, we tried to change our logic a bit. We stopped adding every node as reserved to Generic and started to do this only for nodes that will manage to authenticate themselves as validators (if so, we add them as reserved to both peersets). But the problem escalated: in this version no one can send any message to anyone, since every call to sc_network::service::network.notification_sender(peer, protocol) for any peer and any protocol results in error.
- Are there any restrictions on the number of reserved peers or conditions for adding them?
- What could be the reason why our protocol doesn’t receive messages but the default network does, i.e. blocks are syncing?
- Is it possible to create a notification_sender for a not reserved peer?
CC @tomaka
There is an issue in Substrate master, but I don't think it concerns your branch: https://github.com/paritytech/polkadot-sdk/issues/533
The logs showing Rejected tends to indicate that the peer initiating the connection isn't locally marked as reserved. You are probably already aware of this, but peers need to be marked as reserved by both sides.
As soon as we get Event::SyncConnected we are adding a peer to reserved. Is it possible that, if one node will do it faster than the other, the connection want be established?
It turned out our problems were mostly caused by running on too weak machines. We are still unsure about some of the network abstractions, but this specific issue can be considered fixed.