besu icon indicating copy to clipboard operation
besu copied to clipboard

ChainDownloader broken after threads contention

Open ahamlat opened this issue 3 years ago • 2 comments

Description

One of our users reported a case where Besu stuck for more than 2,5 hours when switching from blocks import step (FastImportBlocksStep) to full import step (FullImportBlockStep). Here're the logs shared by the user.

Aug 30 10:50:40 xxxxxx besu[65329]: 2022-08-30 10:50:40.874+00:00 | EthScheduler-Services-1022 (importBlock) | INFO  | FastImportBlocksStep | Completed importing chain segment 15438477 to 15438776 (300 blocks in 11488ms), Peers: 28
Aug 30 10:50:41 xxxxxx besu[65329]: 2022-08-30 10:50:41.469+00:00 | EthScheduler-Services-1022 (importBlock) | INFO  | PipelineChainDownloader | PipelineChain download complete
Aug 30 10:50:41 xxxxxx besu[65329]: 2022-08-30 10:50:41.797+00:00 | EthScheduler-Services-1022 (importBlock) | INFO  | DefaultSynchronizer | Sync completed successfully with pivot block 15438799
Aug 30 10:50:41 xxxxxx besu[65329]: 2022-08-30 10:50:41.801+00:00 | EthScheduler-Services-1022 (importBlock) | INFO  | BesuControllerBuilder | Initial sync done, unsubscribe finalized block hash supplier
Aug 30 10:50:41 xxxxxx besu[65329]: 2022-08-30 10:50:41.844+00:00 | EthScheduler-Services-1022 (importBlock) | INFO  | FullSyncDownloader | Starting full sync.

I asked for thread dumps to check if there is any thread contention, the user shared three different thread dumps, threads_1.txt, threads_2.txt and threads_3.txt, listed in chronological order. threads_1.txt threads_2.txt threads_3.txt

We can notice a thread contention in the first thread dump, 7 threads blocking on a lock held by another thread nioEventLoopGroup-3-7.

image

The thread that held the lock was executing this code in EthPeers.java

void reattemptPendingPeerRequests() {
    synchronized (this) {
      final Iterator<PendingPeerRequest> iterator = pendingRequests.iterator();
      while (iterator.hasNext()) {
        final PendingPeerRequest request = iterator.next();
        if (request.attemptExecution()) {
          pendingRequests.remove(request);
        }
      }
    }
  }

The worker threads were blocking when executing this code

public PendingPeerRequest executePeerRequest(
      final PeerRequest request, final long minimumBlockNumber, final Optional<EthPeer> peer) {
    final PendingPeerRequest pendingPeerRequest =
        new PendingPeerRequest(this, request, minimumBlockNumber, peer);
    synchronized (this) {
      if (!pendingPeerRequest.attemptExecution()) {
        pendingRequests.add(pendingPeerRequest);
      }
    }
    return pendingPeerRequest;
  }

Once the lock was released (see threads_2.txt and threads_3.txt), the chaindownloader seems to be broken, all the worker threads are just waiting.

Thread contention may have no link with the chaindownloader issue if locking on this was intended and wanted to protect pendingRequests collection.

Acceptance Criteria

Besu doesn't stuck after block import step is finished.

Steps to Reproduce (Bug)

This is very difficult to reproduce.

Expected behavior: [What you expect to happen] Full Sync import step should start just after block import step was finished.

Actual behavior: [What actually happens] Besu stuck for more than 2,5 hours after switching from block import step to full import step.

Frequency: [What percentage of the time does it occur?] rare.

Versions (Add all that apply)

  • Software version: besu/v22.7.1/linux-x86_64/openjdk-java-17
  • Java version: openjdk 17.0.4 2022-07-19
  • OS Name & Version: Ubuntu 22.04.1 LTS

Additional Information (Add any of the following or anything else that may be relevant)

  • System info : Intel i5 (i5-9500 @ 3.00GHz, 6 core), 16GB, with SSD NVMe

ahamlat avatar Aug 31 '22 09:08 ahamlat

I think I may have experienced this as well. Besu seemed to report not syncing on the eth_syncing json api call, so I thought it was done. Lighthouse was reporting an error with updating the deposit contract cache.

Based on my grafana graph, it looks like it had been hung there (based on cpu usage) for 2 days.

I restarted besu and it started doing FullImportBlockStep right away and was over 20k blocks behind and is still syncing now.

terahertz5k avatar Sep 01 '22 20:09 terahertz5k

Interesting, could you share the logs, before, during and after the 2 days ?

ahamlat avatar Sep 02 '22 11:09 ahamlat

We have some users experiencing blocked threads on sync on 22.10.4 and latest as of writing. Worth revisiting?

non-fungible-nelson avatar Jan 26 '23 12:01 non-fungible-nelson