Bad logging / behavior during backwards sync
Description
If besu cannot find peers during backwards sync, it logs indefinitely backwards sync failed, retrying... a more clear log stating we are waiting for peers would be more useful than stating the sync failed.
Acceptance Criteria
- Backwards sync logs after the first failure that it is waiting for peers
- OR Besu waits for peers before beginning a BWS session
Steps to Reproduce (Bug)
- Initiate bws after startup on a network like Goerli
- Besu is likely to fail since the BWS often hits its timeouts / retry limit before Besu can peer
- Eventually recovers
this will be addressed as part of the larger sync refactor
I have come across this on some more nodes recently, except it never seems to recover and requires a restart.
Logs spam with...
{"@timestamp":"2024-03-05T13:57:59,199","level":"INFO","thread":"EthScheduler-Timer-0","class":"BackwardSyncContext","message":"Current backward sync session failed, it will be restarted","throwable":""}
{"@timestamp":"2024-03-05T13:58:01,114","level":"INFO","thread":"vert.x-worker-thread-0","class":"BackwardSyncContext","message":"Starting a new backward sync session","throwable":""}
Occurred on the following nodes/versions/CLs: dev-elc-besu-lodestar-mainnet-simon-24.2.0-RC2-lodestar dev-elc-besu-nimbus-mainnet-simon-24.2.0-RC2-nimbus dev-elc-besu-prysm-holesky-dev-simon-RC4-prysm (24.2.0-RC4)
Potentially also related to https://github.com/hyperledger/besu/issues/6465
I found a different variant of this:
We receive an fcu containing the block hash of a head block. This block is added to the hashesToAppend queue. The block get's reorged and when we try to retrieve that block from out peers none of them is able to provide it to us. This causes the backward sync to fail, and when we receive the next fcu, a new hash might be added to the queue, but a new backward sync will be started, trying to retrieve the block that we have unsuccessfully tried to retrieve before.
This happened on 7 out of 8 nodes I started based on 24.2.0-RC4: dev-elc-besu-teku-mainnet-dev-stefan-rc4-(1,2,4) dev-elc-besu-teku-mainnet-dev-stefan-ss-(1,2,3,4)
The block reorged had the hash 0x4550b82492bf1738af79efb6140770c5443d368b9512ae8551583909554a040f.
Link to Kibane should work for about another 3 weeks: Kibana: https://kibana.dev.protocols.consensys.net/app/r?l=DISCOVER_APP_LOCATOR&v=8.11.0&lz=N4IgjgrgpgTgniAXKSsGJANwLYH0B2AhtlIgDogAmUmAtFADYDGtARlAM4S0AuUA1t2yEAlvnxQetanQ58AZoXy0KAAiWVVJDh0IBzUhQAMADwAsAVgtHWADgBMZgJz3W8gIwB2AMy3C8zycoeVYANnczI09PIyYLMzNvSm9Q21YnC3d7QihbK3cLW28nIwz4wiNI%2BTUQABoQBiU9CH0oJBBBNBAAX3qOAHsYHiQAbRGQAAEeEW0eYgAHOqpOJhAAXTX6pn6GCGx8DlGsPCISJe1dA3X6sWoTdvsCs3t7eSdaIqLaS1DKWicnN4WEF5IlvEZKJ43GYlmI%2BDBMIQGO1CBAeP0lvIRAx4YdECNNlRCHMAGoiKAAdwAkpQHk8Xm8Pr5vN8LL9%2FoDgcEwRCoaCltMSAAlJptZAgeQwfrYdr4foU2jgygAelp9XRsvlPXqMGCuo4AAsqfh4YjkeKzdAkKEjLajPV5qiOGKeDBoN1ukA%3D%3D
Node rc4-1 has been restarted and finished syncing successfully.
Nice find @pinges, indeed we should give precedence to the latest head hash sent via FcU