lighthouse Optimize against ValidatorPubkeyCache timeouts

Description

A user in Discord recently posted the following log:

Apr 28 19:09:08.447 CRIT Beacon block processing error           error: ValidatorPubkeyCacheLockTimeout, service: beacon
Apr 28 19:09:08.447 WARN BlockProcessingFailure                  outcome: ValidatorPubkeyCacheLockTimeout, msg: unexpected condition in processing block.

This user was not the first one, I've personally seen quite a few users report this same rror.

The validator_pubkey_cache is protected by a TimeoutRwLock. This means there can only be lock contention when something holds the write lock.

I think (it's worth confirming) this is the only place we take a write-lock on the pubkey cache:

https://github.com/sigp/lighthouse/blob/3a24ca5f14c6e6d6612fd43eca82aa0c1e6aba16/beacon_node/beacon_chain/src/beacon_chain.rs#L1633-L1641

I think (also worth confirming) that bulk of the time holding the write-lock is spent inside this function:

https://github.com/sigp/lighthouse/blob/3a24ca5f14c6e6d6612fd43eca82aa0c1e6aba16/beacon_node/beacon_chain/src/validator_pubkey_cache.rs#L124-L169

So, it stands to reason that if we can optimize out some time spent in ValidatorPubkeyCache::import then we can avoid ValidatorPubkeyCacheLockTimeout errors.

I think the first step should be to add metrics to ValidatorPubkeyCache::import and determine what's taking the longest.

Potential Optimization

I'm guessing that the write-to-disk part is taking the longest.

It's really important that we retain the current behaviour where the on-disk "backing" is updated somewhat-atomically with the internal Vecs and HashMap.

One potential option is to have a "prepare" step where we can copy the backing to a new random file and add the new keys to it. Then, we have a "commit" phase where we replace the old backing with the new (only a single file copy operation) and update the internal structures.

The "prepare" step only needs a read-lock, whist the "commit" still needs a write-lock (but is hopefully much faster).

Once we determine it is indeed writing to disk that's slow we can expand upon this.

Apr 29 '21 05:04 paulhauner

The default backing is the DB now (since https://github.com/sigp/lighthouse/pull/2234), so we can't use an intermediate file per se, although I agree we can probably do a prepare/commit step with a read lock initially during the DB write, and then a write lock during the mutation.

Apr 29 '21 06:04 michaelsproul

The default backing is the DB now

Oh yeah, I forgot about this.

I agree we can probably do a prepare/commit step with a read lock initially during the DB write, and then a write lock during the mutation.

We just need to maintain some sort of mechanism to ensure we mutate the DB and in-memory cache in lock-step. That might involve moving the locks inside ValidatorPubkeyCache.

Apr 29 '21 06:04 paulhauner

It took me ages to find, but an upgrade-able lock might be useful for this: https://docs.rs/lock_api/0.4.2/lock_api/struct.RwLockUpgradableReadGuard.html#method.upgrade

https://docs.rs/lock_api/0.4.2/lock_api/struct.RwLock.html#method.upgradable_read

May 07 '21 06:05 michaelsproul

Hi lighthouse team, we're currently experiencing issues related to this please find the following log

Sep 09 21:41:21 cqt-operator-p1 systemd[1]: Started lighthouse eth2 service.
Sep 11 02:00:38 cqt-operator-p1 lighthouse[179796]: Sep 11 02:00:38.094 CRIT Beacon block processing error           error: ValidatorPubkeyCacheLockTimeout, service: beacon
Sep 11 02:25:18 cqt-operator-p1 lighthouse[179796]: Sep 11 02:25:18.922 CRIT Chain removed                           op: batch processing result, reason: WrongBatchState("Validation failed for batch in wrong state: AwaitingProcessing(16Uiu2HAm8HcQWjTmpEeqW96nDeYRs5c9bYV7SMPYSyeVGbEdPExU, 63 blocks)"), id: 12382284213649068281, from: 60234, to: 145933, end_root: 0x8093…fac2, current_target: 60238, batches: 7, peers: 4, validated_batches: 0, state: Syncing, sync_type: Finalized, service: sync
Sep 11 09:22:16 cqt-operator-p1 lighthouse[179796]: Sep 11 09:22:16.746 CRIT Beacon block processing error           error: ValidatorPubkeyCacheLockTimeout, service: beacon
Sep 11 10:44:41 cqt-operator-p1 lighthouse[179796]: Sep 11 10:44:41.898 CRIT Beacon block processing error           error: ValidatorPubkeyCacheLockTimeout, service: beacon
Sep 11 16:18:41 cqt-operator-p1 lighthouse[179796]: Sep 11 16:18:41.779 CRIT Beacon block processing error           error: ValidatorPubkeyCacheLockTimeout, service: beacon

The bn is running with the following config -

lighthouse \
--network mainnet \
beacon_node \
--datadir /scratch/lighthouse \
--debug-level crit \
--http \
--checkpoint-sync-url https://beaconstate.ethstaker.cc \
--execution-endpoint http://localhost:8551 \
--execution-jwt /scratch/node/eth-mainnet/geth/jwtsecret

To add we've no idea of the status of the beacon node sync from the above log but a quick size scan of the datadir gave me the following -

root@cqt-operator-p1:/scratch/lighthouse/beacon# sudo du -sh *
42G	chain_db
6.3G	freezer_db
590M	logs
16K	network

the running CL version -

lighthouse --version
Lighthouse v3.1.0-aa022f4
BLS library: blst
SHA256 hardware acceleration: true
Specs: mainnet (true), minimal (false), gnosis (false)

in conjunction with EL version -

Geth
Version: 1.10.23-stable
Git Commit: 5774d758f4f1e45c23ff6c81b957445b6bfcc5e8
Git Commit Date: 20220910
Architecture: amd64
Go Version: go1.18.1
Operating System: linux

OS -

Linux cqt-operator-p1 #15-Ubuntu SMP Fri Jun 10 11:30:24 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Kindly advise on how to proceed!

Sep 11 '22 19:09 noslav

@noslav Try turning on the --disable-lock-timeouts flag. It will silence the error and prevent spurious retries but will not make the node process blocks any faster (the fundamental reason for the timeout). You might want to look into improving the CPU and or I/O performance of your hardware as well.

This issue is about improving our handling of the cache, which will not be available until a future release.

Sep 11 '22 22:09 michaelsproul

@noslav Try turning on the --disable-lock-timeouts flag. It will silence the error and prevent spurious retries but will not make the node process blocks any faster (the fundamental reason for the timeout). You might want to look into improving the CPU and or I/O performance of your hardware as well.

This issue is about improving our handling of the cache, which will not be available until a future release.

thank you for the quick turnaround!

Sep 12 '22 00:09 noslav

@michaelsproul your suggestion helped, thank you! However, I'm wondering why in my logs I don't get the mentioned line here in the faq section on the merge migration page

INFO Ready for the merge, current_difficulty: 10789363, terminal_total_difficulty: 10790000

logs currently only show the following

Sep 12 00:19:41 cqt-operator-p1 systemd[1]: Stopped lighthouse eth2 service.
Sep 12 00:19:41 cqt-operator-p1 systemd[1]: lighthouse.service: Consumed 12h 27min 57.266s CPU time.
Sep 12 00:19:41 cqt-operator-p1 systemd[1]: Started lighthouse eth2 service.
|

has this anything to do with my EL configuration?

Sep 12 '22 20:09 noslav

@noslav Remove the flag --debug-level crit, that flag is causing your node to only log critical errors, which is quite a strange configuration and probably not what you want!

Sep 12 '22 21:09 michaelsproul

I've overhauled the pubkey cache on the tree-states branch by making it into a generic cache for all validator information. I might try a new approach there using a CoW map for the data so that we can clone it when reading, and then mutate a copy while writing to disk, which then gets swapped into place.

Nov 17 '22 05:11 michaelsproul