besu icon indicating copy to clipboard operation
besu copied to clipboard

Bad logging / behavior during backwards sync

Open non-fungible-nelson opened this issue 2 years ago • 4 comments

Description

If besu cannot find peers during backwards sync, it logs indefinitely backwards sync failed, retrying... a more clear log stating we are waiting for peers would be more useful than stating the sync failed.

Acceptance Criteria

  • Backwards sync logs after the first failure that it is waiting for peers
  • OR Besu waits for peers before beginning a BWS session

Steps to Reproduce (Bug)

  1. Initiate bws after startup on a network like Goerli
  2. Besu is likely to fail since the BWS often hits its timeouts / retry limit before Besu can peer
  3. Eventually recovers

non-fungible-nelson avatar Oct 05 '23 18:10 non-fungible-nelson

this will be addressed as part of the larger sync refactor

macfarla avatar Feb 08 '24 21:02 macfarla

I have come across this on some more nodes recently, except it never seems to recover and requires a restart.

Logs spam with...

{"@timestamp":"2024-03-05T13:57:59,199","level":"INFO","thread":"EthScheduler-Timer-0","class":"BackwardSyncContext","message":"Current backward sync session failed, it will be restarted","throwable":""}
{"@timestamp":"2024-03-05T13:58:01,114","level":"INFO","thread":"vert.x-worker-thread-0","class":"BackwardSyncContext","message":"Starting a new backward sync session","throwable":""}

Occurred on the following nodes/versions/CLs: dev-elc-besu-lodestar-mainnet-simon-24.2.0-RC2-lodestar dev-elc-besu-nimbus-mainnet-simon-24.2.0-RC2-nimbus dev-elc-besu-prysm-holesky-dev-simon-RC4-prysm (24.2.0-RC4)

Potentially also related to https://github.com/hyperledger/besu/issues/6465

siladu avatar Mar 05 '24 23:03 siladu

I found a different variant of this: We receive an fcu containing the block hash of a head block. This block is added to the hashesToAppend queue. The block get's reorged and when we try to retrieve that block from out peers none of them is able to provide it to us. This causes the backward sync to fail, and when we receive the next fcu, a new hash might be added to the queue, but a new backward sync will be started, trying to retrieve the block that we have unsuccessfully tried to retrieve before.

This happened on 7 out of 8 nodes I started based on 24.2.0-RC4: dev-elc-besu-teku-mainnet-dev-stefan-rc4-(1,2,4) dev-elc-besu-teku-mainnet-dev-stefan-ss-(1,2,3,4)

The block reorged had the hash 0x4550b82492bf1738af79efb6140770c5443d368b9512ae8551583909554a040f.

Link to Kibane should work for about another 3 weeks: Kibana: https://kibana.dev.protocols.consensys.net/app/r?l=DISCOVER_APP_LOCATOR&v=8.11.0&lz=N4IgjgrgpgTgniAXKSsGJANwLYH0B2AhtlIgDogAmUmAtFADYDGtARlAM4S0AuUA1t2yEAlvnxQetanQ58AZoXy0KAAiWVVJDh0IBzUhQAMADwAsAVgtHWADgBMZgJz3W8gIwB2AMy3C8zycoeVYANnczI09PIyYLMzNvSm9Q21YnC3d7QihbK3cLW28nIwz4wiNI%2BTUQABoQBiU9CH0oJBBBNBAAX3qOAHsYHiQAbRGQAAEeEW0eYgAHOqpOJhAAXTX6pn6GCGx8DlGsPCISJe1dA3X6sWoTdvsCs3t7eSdaIqLaS1DKWicnN4WEF5IlvEZKJ43GYlmI%2BDBMIQGO1CBAeP0lvIRAx4YdECNNlRCHMAGoiKAAdwAkpQHk8Xm8Pr5vN8LL9%2FoDgcEwRCoaCltMSAAlJptZAgeQwfrYdr4foU2jgygAelp9XRsvlPXqMGCuo4AAsqfh4YjkeKzdAkKEjLajPV5qiOGKeDBoN1ukA%3D%3D

Node rc4-1 has been restarted and finished syncing successfully.

pinges avatar Mar 12 '24 07:03 pinges

Nice find @pinges, indeed we should give precedence to the latest head hash sent via FcU

fab-10 avatar Mar 12 '24 13:03 fab-10