besu Bad Block Manager not retaining bad blocks

Description

As a developer, I need my bad block manager in Besu to maintain a list of the lats 100 invalid blocks for review.

Acceptance Criteria

Bad blocks retain up to 100 invalid blocks
Queryable via debug_getBadBlocks

Steps to Reproduce (Bug)

Witness badblock on network
Query debug_getBadBlocks
List is empty, even if Besu has logged an invalidated block

Expected behavior: [What you expect to happen]

Actual behavior: [What actually happens]

Frequency: [What percentage of the time does it occur?]

Logs (if a bug)

2024-02-06 06:24:13.958-07:00 | vert.x-worker-thread-0 | INFO  | AbstractEngineNewPayload | Imported #19,169,472 / 166 tx / 16 ws / base fee 43.82 gwei / 14,602,546 (48.7%) gas / (0x27dcb717d5921af93e2b56a82f546e1c11b33619238510cbe36e4d8c43f446a8) in 0.406s. Peers: 25
2024-02-06 06:24:24.288-07:00 | vert.x-worker-thread-0 | WARN  | AbstractEngineNewPayload | Invalid new payload: number: 19169473, hash: 0x6868f8e474a163bb17f39d05847aa8e3f1b38db3cca0dfc9bd139db1da8dcde2, parentHash: 0x27dcb717d5921af93e2b56a82f546e1c11b33619238510cbe36e4d8c43f446a8, latestValidHash: null, status: INVALID, validationError: Computed block hash 0x86f374fd803515f37f0b4b0e24ae648c0e1c931f9932bd88ce5bbb56d0e0d267 does not match block hash parameter 0x6868f8e474a163bb17f39d05847aa8e3f1b38db3cca0dfc9bd139db1da8dcde2
2024-02-06 06:24:39.661-07:00 | vert.x-worker-thread-0 | INFO  | AbstractEngineNewPayload | Imported #19,169,473 / 284 tx / 16 ws / base fee 43.67 gwei / 21,770,359 (72.6%) gas / (0x951ec60fe8aa99f0535e3f850a70a87648bcb945181d7853009180b8ba62b0be) in 0.740s. Peers: 25
2024-02-06 06:24:49.172-07:00 | vert.x-worker-thread-0 | INFO  | AbstractEngineNewPayload | Imported #19,169,474 / 100 tx / 16 ws / base fee 46.13 gwei / 10,639,937 (35.5%) gas / (0x70a99b648a29b25605da3ab37be3203b711252152c8067388da945f8d5d8dc30) in 0.302s. Peers: 25
2024-02-06 06:25:01.359-07:00 | vert.x-worker-thread-0 | INFO  | AbstractEngineNewPayload | Imported #19,169,475 / 138 tx / 16 ws / base fee 44.46 gwei / 10,579,045 (35.3%) gas / (0xaa8c11db61f83694d1303e63d8d36a5f2acada23c580068bea3fe118bab3dc40) in 0.327s. Peers: 25

RPC Query after the above logs is empty.

Versions (Add all that apply)

Software version: Latest

Feb 08 '24 18:02 non-fungible-nelson

One important thing to take into account is: we don't persist bad blocks over restarts, and if we change this we might want to consider the impacts this will have over the engine_api calls we receive. The restarts sometimes prevents besu from halting on a bad block. It's not clear to me if it's the CL stuck on the same bad block or besu in these cases. If we decide to persist over restarts we might consider to adjust the logic in the engine api to not halt on a bad block develop a way to clean up the badblockmanager if we want to. maybe a debug rpc method?

Feb 16 '24 01:02 gfukushima

@mbaxter any insight here RE your refactor?

Feb 28 '24 21:02 non-fungible-nelson

The refactor here was merged after this issue was created so doesn't look like a regression.

The particular log message in the description ("Computed block hash ... does not match") is generated from the JSON-RPC method engine_newPayloadVX. We are not pushing blocks into the BadBlockManager based on the high-level validation happening in this new payload method. We are currently only marking blocks bad when we try to actually import them locally (or we find descendants of known bad blocks).

Not sure I have enough context to say whether bad blocks should be cached from this JSON-RPC method, but it looks like we are validating the request parameters, in which case it would make sense that we're not caching this...

Feb 29 '24 13:02 mbaxter

As for persisting bad blocks across restarts, I suspect that may cause more harm than good? In the current worst case, as besu operates now, on restart we'll have to potentially reprocess some bad blocks before they're back in our cache. However, if there's a bug in bad block caching such that we mark a good block bad and we persist across restarts, it will require some more complicated manual intervention to get besu moving again.

Feb 29 '24 13:02 mbaxter