substrate icon indicating copy to clipboard operation
substrate copied to clipboard

Node keeps loosing peers.

Open arkpar opened this issue 3 years ago • 5 comments

Start a new warp sync using latest polkadot+substrate master with an empty db dir in GCP. The nodes keeps loosing almost all peers every couple of minutes.

a@a-db:~/src/polkadot$ ./target/release/polkadot -d ~/db3 --sync=warp -l db=trace,sync=trace 2>warp.log
2022-09-08 12:45:31.522  INFO main sc_cli::runner: Parity Polkadot
2022-09-08 12:45:31.522  INFO main sc_cli::runner: ✌️  version 0.9.28-b2ff4c05ab3
2022-09-08 12:45:31.522  INFO main sc_cli::runner: ❤️  by Parity Technologies <[email protected]>, 2017-2022
2022-09-08 12:45:31.522  INFO main sc_cli::runner: 📋 Chain specification: Polkadot
2022-09-08 12:45:31.522  INFO main sc_cli::runner: 🏷  Node name: certain-bean-5721
2022-09-08 12:45:31.522  INFO main sc_cli::runner: 👤 Role: FULL
2022-09-08 12:45:31.522  INFO main sc_cli::runner: 💾 Database: RocksDb at /home/arkadiy/db3/chains/polkadot/db/full
2022-09-08 12:45:31.522  INFO main sc_cli::runner: ⛓  Native runtime: polkadot-9280 (parity-polkadot-0.tx13.au0)
2022-09-08 12:45:35.022  INFO main sc_service::client::client: 🔨 Initializing Genesis block/state (state: 0x29d0…4e17, header-hash: 0x91b1…90c3)
2022-09-08 12:45:35.040  INFO main afg: 👴 Loading GRANDPA authority set from genesis on what appears to be first startup.
2022-09-08 12:45:35.618  INFO main babe: 👶 Creating empty BABE epoch changes on what appears to be first startup.
2022-09-08 12:45:35.620  INFO main sub-libp2p: 🏷  Local node identity is: 12D3KooWL9R2Y3DuFdAueryD5nn6cL6BZs4RkMimC9cYkxyHE3Uh
2022-09-08 12:45:35.710  INFO main sc_sysinfo: 💻 Operating system: linux
2022-09-08 12:45:35.710  INFO main sc_sysinfo: 💻 CPU architecture: x86_64
2022-09-08 12:45:35.710  INFO main sc_sysinfo: 💻 Target environment: gnu
2022-09-08 12:45:35.710  INFO main sc_sysinfo: 💻 CPU: Intel(R) Xeon(R) CPU @ 2.20GHz
2022-09-08 12:45:35.710  INFO main sc_sysinfo: 💻 CPU cores: 8
2022-09-08 12:45:35.710  INFO main sc_sysinfo: 💻 Memory: 16039MB
2022-09-08 12:45:35.710  INFO main sc_sysinfo: 💻 Kernel: 4.19.0-21-cloud-amd64
2022-09-08 12:45:35.710  INFO main sc_sysinfo: 💻 Linux distribution: Debian GNU/Linux 10 (buster)
2022-09-08 12:45:35.710  INFO main sc_sysinfo: 💻 Virtual machine: yes
2022-09-08 12:45:35.710  INFO main sc_service::builder: 📦 Highest known block at #0
2022-09-08 12:45:35.710  INFO tokio-runtime-worker substrate_prometheus_endpoint: 〽️ Prometheus exporter started at 127.0.0.1:9615
2022-09-08 12:45:35.713  INFO                 main sc_rpc_server: Running JSON-RPC HTTP server: addr=127.0.0.1:9933, allowed origins=Some(["http://localhost:*", "http://127.0.0.1:*", "https://localhost:*", "https://127.0.0.1:*", "https://polkadot.js.org"])
2022-09-08 12:45:35.714  INFO                 main sc_rpc_server: Running JSON-RPC WS server: addr=127.0.0.1:9944, allowed origins=Some(["http://localhost:*", "http://127.0.0.1:*", "https://localhost:*", "https://127.0.0.1:*", "https://polkadot.js.org"])
2022-09-08 12:45:35.714  INFO                 main sc_sysinfo: 🏁 CPU score: 609MB/s
2022-09-08 12:45:35.714  INFO                 main sc_sysinfo: 🏁 Memory score: 4530MB/s
2022-09-08 12:45:35.714  INFO                 main sc_sysinfo: 🏁 Disk score (seq. writes): 475MB/s
2022-09-08 12:45:35.714  INFO                 main sc_sysinfo: 🏁 Disk score (rand. writes): 204MB/s
2022-09-08 12:45:35.717  INFO tokio-runtime-worker libp2p_mdns::behaviour::iface: creating instance on iface 10.132.0.41
2022-09-08 12:45:38.867  INFO tokio-runtime-worker sub-libp2p: 🔍 Discovered new external address for our node: /ip4/34.78.133.145/tcp/30333/ws/p2p/12D3KooWL9R2Y3DuFdAueryD5nn6cL6BZs4RkMimC9cYkxyHE3Uh
2022-09-08 12:45:42.255  INFO tokio-runtime-worker substrate: ⏩ Warping, Downloading finality proofs, 15.97 Mib (19 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 2.7MiB/s ⬆ 26.9kiB/s
2022-09-08 12:45:49.053  INFO tokio-runtime-worker substrate: ⏩ Warping, Downloading finality proofs, 31.92 Mib (39 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 2.7MiB/s ⬆ 7.4kiB/s
2022-09-08 12:45:55.835  INFO tokio-runtime-worker substrate: ⏩ Warping, Downloading finality proofs, 47.89 Mib (45 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 2.7MiB/s ⬆ 16.7kiB/s
2022-09-08 12:45:55.836  WARN tokio-runtime-worker telemetry: ❌ Error while dialing /dns/telemetry.polkadot.io/tcp/443/x-parity-wss/%2Fsubmit%2F: Custom { kind: Other, error: Timeout }
2022-09-08 12:46:00.836  INFO tokio-runtime-worker substrate: ⏩ Warping, Downloading state, 99.24 Mib (50 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 6.5MiB/s ⬆ 23.8kiB/s
2022-09-08 12:46:05.836  INFO tokio-runtime-worker substrate: ⏩ Warping, Downloading state, 154.66 Mib (50 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 7.0MiB/s ⬆ 12.8kiB/s
2022-09-08 12:46:10.837  INFO tokio-runtime-worker substrate: ⏩ Warping, Downloading state, 207.67 Mib (50 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 6.9MiB/s ⬆ 7.3kiB/s
2022-09-08 12:46:15.837  INFO tokio-runtime-worker substrate: ⏩ Warping, Downloading state, 263.09 Mib (50 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 6.8MiB/s ⬆ 6.0kiB/s
2022-09-08 12:46:20.864  INFO tokio-runtime-worker substrate: ⏩ Warping, Downloading state, 304.35 Mib (50 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 4.4MiB/s ⬆ 5.0kiB/s
2022-09-08 12:46:25.865  INFO tokio-runtime-worker substrate: ⏩ Warping, Downloading state, 401.33 Mib (49 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 17.4MiB/s ⬆ 9.4kiB/s
2022-09-08 12:46:30.865  INFO tokio-runtime-worker substrate: ⏩ Warping, Downloading state, 500.22 Mib (38 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 17.7MiB/s ⬆ 10.9kiB/s
2022-09-08 12:46:35.866  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (37 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 13.8MiB/s ⬆ 5.5kiB/s
2022-09-08 12:46:40.866  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (29 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 2.6kiB/s ⬆ 2.5kiB/s
2022-09-08 12:46:45.866  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (28 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 23.9kiB/s ⬆ 15.6kiB/s
2022-09-08 12:46:50.866  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (31 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 3.0kiB/s ⬆ 2.9kiB/s
2022-09-08 12:46:55.867  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (28 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 16.8kiB/s ⬆ 10.3kiB/s
2022-09-08 12:47:00.867  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (36 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 3.4kiB/s ⬆ 3.8kiB/s
2022-09-08 12:47:04.229  WARN    async-std/runtime trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender
2022-09-08 12:47:05.868  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (36 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 1.4kiB/s ⬆ 1.8kiB/s
2022-09-08 12:47:10.868  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (37 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 1.8kiB/s ⬆ 1.4kiB/s
2022-09-08 12:47:19.066  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (38 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 1.4kiB/s ⬆ 1.3kiB/s
2022-09-08 12:47:25.257  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (39 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 2.0kiB/s ⬆ 1.8kiB/s
2022-09-08 12:47:28.702  INFO tokio-runtime-worker db: Removing block #BlockId::Hash(0x91b171bb158e2d3848fa23a9f1c25182fb8e20313b2c1eb49219da7a70ce90c3)
2022-09-08 12:47:30.412  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (40 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 2.8kiB/s ⬆ 2.7kiB/s
2022-09-08 12:47:35.574  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (29 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 5.2kiB/s ⬆ 5.9kiB/s
2022-09-08 12:47:40.762  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (32 peers), best: #0 (0x91b1…90c3), finalized #0 (0x91b1…90c3), ⬇ 3.8kiB/s ⬆ 3.8kiB/s
2022-09-08 12:47:48.754  INFO tokio-runtime-worker substrate: ⏩ Warping, Importing state, 577.70 Mib (28 peers), best: #11958735 (0x7bef…21c4), finalized #11958735 (0x7bef…21c4), ⬇ 9.0kiB/s ⬆ 5.6kiB/s
2022-09-08 12:47:50.786  INFO tokio-runtime-worker sync: Warp sync is complete (577 MiB), restarting block sync.
2022-09-08 12:47:52.460  INFO tokio-runtime-worker substrate: ✨ Imported #11958757 (0x130e…291d)
2022-09-08 12:47:53.758  INFO tokio-runtime-worker substrate: ⏩ Block history, #3648 (7 peers), best: #11958757 (0x130e…291d), finalized #11958735 (0x7bef…21c4), ⬇ 442.0kiB/s ⬆ 9.0kiB/s
2022-09-08 12:47:54.771  INFO tokio-runtime-worker substrate: ✨ Imported #11958758 (0xe724…8fbd)
2022-09-08 12:47:58.759  INFO tokio-runtime-worker substrate: ⏩ Block history, #14656 (21 peers), best: #11958758 (0xe724…8fbd), finalized #11958754 (0x3e53…0381), ⬇ 838.2kiB/s ⬆ 31.2kiB/s
2022-09-08 12:48:00.759  INFO tokio-runtime-worker substrate: ✨ Imported #11958759 (0x30e1…423e)
2022-09-08 12:48:00.822  INFO tokio-runtime-worker substrate: ✨ Imported #11958759 (0x75b0…42c0)
2022-09-08 12:48:03.760  INFO tokio-runtime-worker substrate: ⏩ Block history, #21504 (3 peers), best: #11958759 (0x30e1…423e), finalized #11958756 (0x4f92…0927), ⬇ 570.2kiB/s ⬆ 11.1kiB/s
2022-09-08 12:48:06.662  INFO tokio-runtime-worker substrate: ✨ Imported #11958760 (0x5779…1a5f)
2022-09-08 12:48:08.761  INFO tokio-runtime-worker substrate: ⏩ Block history, #30848 (23 peers), best: #11958760 (0x5779…1a5f), finalized #11958757 (0x130e…291d), ⬇ 1.0MiB/s ⬆ 87.0kiB/s
2022-09-08 12:48:13.310  INFO tokio-runtime-worker substrate: ✨ Imported #11958761 (0x4829…083b)
2022-09-08 12:48:13.434  INFO tokio-runtime-worker sc_informant: ♻️  Reorg on #11958761,0x4829…083b to #11958761,0xcf72…6052, common ancestor #11958760,0x5779…1a5f
2022-09-08 12:48:13.435  INFO tokio-runtime-worker substrate: ✨ Imported #11958761 (0xcf72…6052)
2022-09-08 12:48:13.763  INFO tokio-runtime-worker substrate: ⏩ Block history, #39424 (2 peers), best: #11958761 (0xcf72…6052), finalized #11958758 (0xe724…8fbd), ⬇ 784.1kiB/s ⬆ 60.7kiB/s

Notice how the number of peers jumps from 20+ to 2. Unfortunately our network protocol does not include reason for why other peers decide to disconnect. However it looks like this is happening after we announce latest blocks to other peers. It could be that something was added to the block announcement that broke network protocol compatibility because of this code here: https://github.com/paritytech/substrate/blob/ded44948e2d5a398abcb4e342b0513cb690961bb/primitives/consensus/common/src/block_validation.rs#L89

arkpar avatar Sep 08 '22 13:09 arkpar

@dmitry-markin and @altonen could both of you please both look into this?

bkchr avatar Sep 08 '22 14:09 bkchr

To share intermediate results: I was able to reproduce the issue on polkadot+substrate master, but also on polkadot-0.9.28 & polkadot-0.9.16. So, it looks like this is not something introduced by the recent changes. Also, a couple of times everything worked fine (on master), so I'm starting to be suspicious that the issue might be caused by some external factors.

dmitry-markin avatar Sep 13 '22 12:09 dmitry-markin

So, it looks like this is not something introduced by the recent changes. Also, a couple of times everything worked fine (on master), so I'm starting to be suspicious that the issue might be caused by some external factors.

I saw this quite a while ago on my PC as well, but thought it was normal. Like dropping peers from 30 to 10 or even less, and then slowly creeping back up.

ggwpez avatar Sep 13 '22 12:09 ggwpez

I didn't look into warping, but if it downloads significant chunks of data, it might have the same root cause as https://github.com/paritytech/substrate/issues/12105, namely keep alive timeout.

nazar-pc avatar Sep 13 '22 12:09 nazar-pc

I didn't look into warping, but if it downloads significant chunks of data, it might have the same root cause as #12105, namely keep alive timeout.

I tried reducing keep alive timeout from 10 sec to 1 sec (locally) to see if it affects the peer count, and it looks like it doesn't. Overall, the peer count jitter seems purely random: sometimes the peer count is stable on 20-30 peers, and sometimes it starts dropping to almost 0.

dmitry-markin avatar Sep 13 '22 14:09 dmitry-markin