"Failed to send scheduled attestation" errors—how to remedy or at least reduce?
Description
The log of lighthouse beacon_node has large clumps of entries of:
21:06:33.945 ERRO Failed to send scheduled attestation
21:06:33.945 ERRO Failed to send scheduled attestation
21:06:33.946 ERRO Failed to send scheduled attestation
[…and so on and so on]
Everything is working perfectly (e.g. no recent restarts) save for the occasional Previous epoch attestation(s) failed to match head and Previous epoch attestation(s) had sub-optimal inclusion delay warnings, and then all at once I get more than 100 of those ERRO log entries.
These clumps appear perhaps once every 8 hours on a beacon node whose validator client has ~100 validators. Notable configuration includes:
--validator-monitor-auto
--http-disable-legacy-spec
--block-cache-size 15
At the time of the most recent occurrence (note peers):
21:06:29.001 INFO Synced slot: 4394730, block: … empty, epoch: 137335, finalized_epoch: 137333, finalized_root: 0x585c…60d2, exec_hash: n/a, peers: 85, service: slot_notifier
UPDATE:
I am now also getting all-at-once gobs of these sorts of errors:
ERRO Unable to send message to the beacon processor, type: gossip_attestation, error: no available capacity, service: router
They were preceded by one of these entries:
ERRO Attestation queue full queue_len: 16384, msg: the system has insufficient resources for load
Also have gobs of these two:
ERRO Unable to send message to the beacon processor, type: gossip_aggregate, error: no available capacity, service: router
ERRO slog-async: logger dropped messages due to channel overflow, count: 11, service: router
…and then:
ERRO Attestation delay queue is full msg: check system clock, queue_size: 16384
…followed by another slew of:
ERRO Failed to send scheduled attestation
According to htop I am using just a fraction of available RAM, and instantaneous average CPU use (8-core) spikes to 100% for about 3 seconds about every 15 seconds, then goes back to about 10%.
● ntp.service - Network Time Service
Loaded: loaded (/lib/systemd/system/ntp.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2022-07-22 16:41:38 EDT; 1 weeks 5 days ago
Docs: man:ntpd(8)
Main PID: 750 (ntpd)
Tasks: 2 (limit: 18486)
Memory: 1.8M
CGroup: /system.slice/ntp.service
└─710 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 113:121
Warning: journal has been rotated since unit was started, output may be incomplete.
~$
END OF UPDATE; original ticket continues below.
Version
https://github.com/sigp/lighthouse/releases/download/v2.5.1/lighthouse-v2.5.1-aarch64-unknown-linux-gnu.tar.gz
Steps to resolve
If anyone is able to tell me whether any of the below tactics (or others) may possibly reduce the frequency of said ERRO events, and hence to resolve the issue, or—just as valuable—if any/all are likely to have no helpful effect, I would appreciate it:
- Enabling
--subscribe-all-subnets - Increase/decrease
target-peers - Increase/decrease
block-cache-size - Increase the IOPS (I/O ops per second) available to the SDD storage of the
data-dir - Increase the throughput (MB per second) available to the SDD storage of the
data-dir - Increase the number of CPU/vCPU cores of the cloud instance running
lighthouse beacon_node - Increase the allotted network performance capacity of the cloud instance running
lighthouse beacon_node
Thanks for reading.
These logs generally indicate it's an issue of system resources, if it doesn't look RAM is near it's limit, it might be a limitation of the CPU or disk. If the VPS uses shared CPUs that might be causing issues, it could be throttling you during the bursts you are seeing. So increasing the number of CPUs might solve it (although 8 should be plenty if you are just running lighthouse on this machine). If you have grafana metrics set up, you could also get an idea of whether it might be a CPU vs disk bottleneck by checking out some dashboards