[Windows] Ungraceful shutdown of node corrupts the database
Describe the bug Running the node on Windows and closing by clicking on "X" in the top right corner of the windows app has a high probability of corrupting the database.
To Reproduce
- Download the Window grin-node release
- Remove the .grin folder if you have it
- Open grin node and let it start sync
- Close it after 20 seconds or so by clicking 'X'
It doesn't always corrupt the data, but every 3rd try it usually ends in a bad state from which it can't recover.
Relevant Information Logs when trying to run the node after a corruption occured.
20210109 18:59:05.601 INFO grin_util::logger - log4rs is initialized, file level: Info, stdout level: Warn, min. level: Info
20210109 18:59:05.603 INFO grin - Using configuration file at C:\Users\phyro\.grin\main\grin-server.toml
20210109 18:59:05.603 INFO grin - This is Grin version 5.0.1 (git v5.0.1), built for x86_64-pc-windows-msvc by rustc 1.48.0 (7eac88abb 2020-11-16).
20210109 18:59:05.603 INFO grin - Chain: Mainnet
20210109 18:59:05.603 INFO grin - Accept Fee Base: 500000
20210109 18:59:05.603 INFO grin - Future Time Limit: 300
20210109 18:59:05.603 INFO grin - Feature: NRD kernel enabled: false
20210109 18:59:05.603 WARN grin::cmd::server - Starting GRIN in UI mode...
20210109 18:59:05.603 INFO grin_servers::grin::server - Starting server, genesis block: 40adad0aec27
20210109 18:59:05.779 ERROR grin_util::logger -
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Chain(Error { inner: NotFoundErr("BLOCK HEADER: 000007f722ca")
Store Error: NotFoundErr("BLOCK HEADER: 000007f722ca"), reason: DB Not Found Error: BLOCK HEADER: 000007f722ca })': src/bin\cmd\server.rs:58 0: <unknown>
1: <unknown>
2: <unknown>
3: <unknown>
4: <unknown>
5: <unknown>
6: <unknown>
7: <unknown>
8: <unknown>
9: <unknown>
10: <unknown>
11: <unknown>
12: <unknown>
13: <unknown>
14: <unknown>
15: <unknown>
16: <unknown>
17: <unknown>
18: BaseThreadInitThunk
19: RtlUserThreadStart
Screenshots /
Desktop (please complete the following information): I've noticed this on Windows.
Additional context
It works if you quit the "Grin way" by pressing 'q' and waiting for the cleanup prior to shutdown.
Here's probably some context around the problem.
https://stackoverflow.com/questions/26658707/windows-console-application-signal-for-closing-event
I believe the ctrlc package we use may not support the SIGBREAK signal. Perhaps using a different library to catch these signals and then reacting on it also in this case would solve the issue, but I didn't dive too deep.
This could definitely use a CI test that exposes the problem; that might be a good place to start.
I think we may have actually tracked down the NotFoundErr("BLOCK HEADER: error when we were frantically implementing the "rewind bad block" logic for #3605.
Or at least one situation where it can surface. Its related to the "sync MMR" and I believe fixed here -
https://github.com/mimblewimble/grin/pull/3603/files#diff-29a4dbf8cadda7c8ed1b4e35a237dfebe170b3703d7af2c78c065ae65ab04100
tl;dr the sync MMR gets "out of sync" relative to the header MMR and ends up in a state where it continues to refer to a header that no longer exists.
I'm not sure this actually resolves the issue identified in this PR. Its also not technically data corruption as we can recover from it and its more a case of us just handling it badly on startup.
To be clear we definitely do still have situations where it is possible to legitimately corrupt the data on disk, but this missing header is (I believe) a less severe problem that can be handled more robustly during node startup (and I think we do that now with the fix linked above).
- Remove the .grin folder if you have it
- Open grin node and let it start sync
- Close it after 20 seconds or so by clicking 'X'
So yes this is happening during "header sync" so in this specific case we are not deleting a header for it go go missing, but we are shutting the node down during the period where it is writing the sync MMR files to disk, but has not yet committed the batch of headers to the db. So on next startup the sync MMR is out beyond the headers in the db and the "sync head" points to a non-existent header. We want to reset the sync MMR to an earlier state which I believe we now correctly do in this situation, we just need to get past the initial startup where it makes an assumption that the header exists.