"Failed to send scheduled attestation" errors—how to remedy or at least reduce?

Open JamesCropcho opened this issue 3 years ago • 1 comments

Description

The log of lighthouse beacon_node has large clumps of entries of:

21:06:33.945 ERRO Failed to send scheduled attestation
21:06:33.945 ERRO Failed to send scheduled attestation
21:06:33.946 ERRO Failed to send scheduled attestation

[…and so on and so on]

Everything is working perfectly (e.g. no recent restarts) save for the occasional Previous epoch attestation(s) failed to match head and Previous epoch attestation(s) had sub-optimal inclusion delay warnings, and then all at once I get more than 100 of those ERRO log entries.

These clumps appear perhaps once every 8 hours on a beacon node whose validator client has ~100 validators. Notable configuration includes:

--validator-monitor-auto
--http-disable-legacy-spec
--block-cache-size 15

At the time of the most recent occurrence (note peers):

21:06:29.001 INFO Synced  slot: 4394730, block:    …  empty, epoch: 137335, finalized_epoch: 137333, finalized_root: 0x585c…60d2, exec_hash: n/a, peers: 85, service: slot_notifier

UPDATE:

I am now also getting all-at-once gobs of these sorts of errors:

ERRO Unable to send message to the beacon processor, type: gossip_attestation, error: no available capacity, service: router

They were preceded by one of these entries:

ERRO Attestation queue full                  queue_len: 16384, msg: the system has insufficient resources for load

Also have gobs of these two:

ERRO Unable to send message to the beacon processor, type: gossip_aggregate, error: no available capacity, service: router
ERRO slog-async: logger dropped messages due to channel overflow, count: 11, service: router

…and then:

ERRO Attestation delay queue is full         msg: check system clock, queue_size: 16384

…followed by another slew of:

ERRO Failed to send scheduled attestation

According to htop I am using just a fraction of available RAM, and instantaneous average CPU use (8-core) spikes to 100% for about 3 seconds about every 15 seconds, then goes back to about 10%.

● ntp.service - Network Time Service
     Loaded: loaded (/lib/systemd/system/ntp.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2022-07-22 16:41:38 EDT; 1 weeks 5 days ago
       Docs: man:ntpd(8)
   Main PID: 750 (ntpd)
      Tasks: 2 (limit: 18486)
     Memory: 1.8M
     CGroup: /system.slice/ntp.service
             └─710 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 113:121

Warning: journal has been rotated since unit was started, output may be incomplete.
~$

END OF UPDATE; original ticket continues below.

Version

https://github.com/sigp/lighthouse/releases/download/v2.5.1/lighthouse-v2.5.1-aarch64-unknown-linux-gnu.tar.gz

Steps to resolve

If anyone is able to tell me whether any of the below tactics (or others) may possibly reduce the frequency of said ERRO events, and hence to resolve the issue, or—just as valuable—if any/all are likely to have no helpful effect, I would appreciate it:

Enabling --subscribe-all-subnets
Increase/decrease target-peers
Increase/decrease block-cache-size
Increase the IOPS (I/O ops per second) available to the SDD storage of the data-dir
Increase the throughput (MB per second) available to the SDD storage of the data-dir
Increase the number of CPU/vCPU cores of the cloud instance running lighthouse beacon_node
Increase the allotted network performance capacity of the cloud instance running lighthouse beacon_node

Thanks for reading.

Aug 03 '22 21:08 JamesCropcho

These logs generally indicate it's an issue of system resources, if it doesn't look RAM is near it's limit, it might be a limitation of the CPU or disk. If the VPS uses shared CPUs that might be causing issues, it could be throttling you during the bursts you are seeing. So increasing the number of CPUs might solve it (although 8 should be plenty if you are just running lighthouse on this machine). If you have grafana metrics set up, you could also get an idea of whether it might be a CPU vs disk bottleneck by checking out some dashboards

Sep 07 '22 14:09 realbigsean