witnet-rust icon indicating copy to clipboard operation
witnet-rust copied to clipboard

Missing blocks in storage

Open tmpolaczyk opened this issue 4 years ago • 3 comments

I noticed some errors about missing blocks in one node. It is not clear if that's a problem with witnet-rust, or a problem with rocksdb, so I think the best solution for now would be to assume that this can happen and perform automatic database integrity checks or something similar.

I implemented a JSON-RPC command to check for missing blocks, available in this branch. Can be run as:

echo '{"jsonrpc":"2.0","id":"1","method":"checkBlockChain","params":null}' | ./target/release/witnet -c witnet.toml node raw

And the result is this list of (epoch, block_hash) that is missing from that one node:

[[681700,"c2674475ed7cb14f83d782ca6d40d27908b1393df779445e07253667e690983c"],[681701,"661fae380e01a03acd7332f0430d735c1b98eb615bb8ef00c58183b72317a173"],[681702,"1162fdfc976212e6571b848a59d4288496333bf5a4e87413093df937bd9e37e4"],[681703,"e004ef4925350f4bb4b916fc72bb14a3043cb6bc926a646465d9d180978fa9bf"],[681704,"dadabbbb1e9d99ed5c444cfceacc71791ab657931306feaac6f6e6e9edc7ef38"],[681705,"ae36350715c1f68df72377aa07590c3c4b0e8d78985881aba9b2658e1c9624d8"],[681706,"218f41843fcdc5f67e2e0e798fbc49c7293b0df51ac9dd05a3bf77324b17857c"],[681707,"07995e9b4857265fa0f6df95f60357f67ce7e5256f62a017b896992c69bdd6c4"],[681708,"7adcc6fd6df0fe2f205adf2d7cd44db63d253b0f2253b224c1496f4bdd690dd2"],[681709,"b995950291409753dd4c91184a5edbf703210284069d500d2d3401536bf3cb60"]]

Currently this kind of errors can be solved by doing a rewind, which will process all the blocks until the first missing block, and then continue doing a normal (slow) synchronization. So the next step would be to implement some functionality to retrieve those missing blocks from other peers automatically, without getting the user to run the rewind command.

tmpolaczyk avatar Oct 25 '21 15:10 tmpolaczyk

I was able to reproduce this issue the other day. It can happen while synchronizing a node, steps to try to reproduce:

  • Start node and wait until it is synchronizing
  • While the node is processing a blocks batch, press ctrl-c to stop the node

The node will not stop immediately because it needs to finish processing the blocks batch. However, after processing the batch it keeps running for a second and then segfaults with the message:

pthread lock: Invalid argument
Aborted (core dumped)

And then the node is missing exactly 20 blocks from this batch.

Probably related to #2008

tmpolaczyk avatar Mar 25 '22 16:03 tmpolaczyk

@tmpolaczyk was this issue closed in #2159?

Tommytrg avatar Jun 27 '22 09:06 Tommytrg

No, this is still not fixed.

tmpolaczyk avatar Jul 04 '22 07:07 tmpolaczyk