segmentation fault : rocksdb : Too many open files
Summary
- After my DB got destroyed by heavy load again, i tried to start with a snapshot as always. But without success. Every time got the same error (see below)
Steps to Reproduce
- just to have a clean directory (the behavior is the same with the old fully configured .defi folder)
- mv .defi .defi_backup
- mkdir -p .defi/snapshot
- cd .defi/snapshot
- wget https://defi-snapshots-europe.s3.eu-central-1.amazonaws.com/snapshot-mainnet-4417683.zip (i tried the last 3 so far)
- unzip
- mv indexes evm blocks history burn anchors chainstate enhancedcs ../
- start defid
Environment
- Node Version: 4.2.0
- Block height on bug if applicable:
- TX or TX type on bug if applicable:
- OS with version: archlinux with 6.11.3-arch1-1 on btrfs
RUST_BACKTRACE=full
2024-10-16T18:59:04Z [ain_grpc] Starting JSON Eth RPC server at 127.0.0.1:8551
2024-10-16T18:59:04Z [ain_grpc] Init rs ws services
2024-10-16T18:59:04Z [ain_grpc] Starting ws server at 127.0.0.1:8553
2024-10-16T18:59:04Z init message: Done loading
2024-10-16T18:59:04Z 0 addresses found from DNS seeds
2024-10-16T18:59:04Z dnsseed thread exit
2024-10-16T18:59:04Z New outbound peer connected: version: 70038, blocks=4426259, peer=0
2024-10-16T18:59:05Z New outbound peer connected: version: 70038, blocks=4426259, peer=1
2024-10-16T18:59:05Z Synchronizing blockheaders, height: 4419682 (~99.85%)
2024-10-16T18:59:05Z New outbound peer connected: version: 70038, blocks=4426259, peer=2
2024-10-16T18:59:06Z Synchronizing blockheaders, height: 4421682 (~99.90%)
2024-10-16T18:59:06Z [vsdb_core::common::engines::rocks_db] rocksdb write_buffer_size per column family = 512MB
thread '<unnamed>' panicked at /github/home/.cargo/git/checkouts/vsdb-80d7017a79ae04c0/020aaf6/core/src/common/engines/rocks_db.rs:31:67:
called `Result::unwrap()` on an `Err` value: SimpleError { msg: SimpleMsg { err: "", file: "/github/home/.cargo/git/checkouts/vsdb-80d7017a79ae04c0/020aaf6/core/src/common/engines/rocks_db.rs", line: 436, column: 57 }, cause: Some(SimpleError { msg: SimpleMsg { err: "IO error: While open a file for random read: /home/smo/.defi/evm/.vsdb/002338.sst: Too many open files", file: "/github/home/.cargo/git/checkouts/vsdb-80d7017a79ae04c0/020aaf6/core/src/common/engines/rocks_db.rs", line: 436, column: 57 }, cause: None }) }
stack backtrace:
0: 0x57906bc108af - std::backtrace_rs::backtrace::libunwind::trace::h1a07e5dba0da0cd2
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/../../backtrace/src/backtrace/libunwind.rs:105:5
1: 0x57906bc108af - std::backtrace_rs::backtrace::trace_unsynchronized::h61b9b8394328c0bc
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: 0x57906bc108af - std::sys_common::backtrace::_print_fmt::h1c5e18b460934cff
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:68:5
3: 0x57906bc108af - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h1e1a1972118942ad
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:44:22
4: 0x57906b9e20bb - core::fmt::rt::Argument::fmt::h07af2b4071d536cd
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/fmt/rt.rs:165:63
5: 0x57906b9e20bb - core::fmt::write::hc090a2ffd6b28c4a
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/fmt/mod.rs:1157:21
6: 0x57906bbe3b52 - std::io::Write::write_fmt::h8898bac6ff039a23
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/io/mod.rs:1832:15
7: 0x57906bc12399 - std::sys_common::backtrace::_print::h4e80c5803d4ee35b
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:47:5
8: 0x57906bc12399 - std::sys_common::backtrace::print::ha96650907276675e
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:34:9
9: 0x57906bc11bbe - std::panicking::default_hook::{{closure}}::h215c2a0a8346e0e0
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:271:22
10: 0x57906bc12e59 - std::panicking::default_hook::h207342be97478370
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:298:9
11: 0x57906bc12e59 - std::panicking::rust_panic_with_hook::hac8bdceee1e4fe2c
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:795:13
12: 0x57906bc12702 - std::panicking::begin_panic_handler::{{closure}}::h00d785e82757ce3c
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:664:13
13: 0x57906bc12659 - std::sys_common::backtrace::__rust_end_short_backtrace::h1628d957bcd06996
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:171:18
14: 0x57906bc12646 - rust_begin_unwind
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:652:5
15: 0x57906ab74b72 - core::panicking::panic_fmt::hdc63834ffaaefae5
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:72:14
16: 0x57906ab74f85 - core::result::unwrap_failed::h82b551e0ff2b2176
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1654:5
17: 0x57906bc5c398 - core::ops::function::FnOnce::call_once::h9be83c72fcb34d3b
18: 0x57906bc634e8 - once_cell::imp::OnceCell<T>::initialize::{{closure}}::h16f939f811f483c4
19: 0x57906bb29f3f - once_cell::imp::initialize_or_wait::h63f27d139639c331
20: 0x57906ab9287b - once_cell::imp::OnceCell<T>::initialize::h15637d85abd0c950
21: 0x57906bc66130 - core::ops::function::FnOnce::call_once::h3b47c4b74e2b2ad2
22: 0x57906bc638db - once_cell::imp::OnceCell<T>::initialize::{{closure}}::ha9f41780e052a633
23: 0x57906bb29f3f - once_cell::imp::initialize_or_wait::h63f27d139639c331
24: 0x57906ab976ab - once_cell::imp::OnceCell<T>::initialize::h8e0d2283277274f6
25: 0x57906bc8fe24 - vsdb_trie_db::MptStore::trie_restore::h3ca9b979e2e376ac
26: 0x57906b50e865 - ain_evm::backend::EVMBackend::from_root::h171eaddb7be2fd68
27: 0x57906b41dbd5 - cxxbridge1$evm_try_unsafe_create_template
28: 0x57906b259f20 - _Z30evm_try_unsafe_create_templateR19CrossBoundaryResultmN4rust10cxxbridge13StrEjmm
29: 0x57906ac81036 - _ZN15CScopedTemplate6CreateEmNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEjmm
30: 0x57906b1e8ec4 - _ZN11CChainState12ConnectBlockERK6CBlockR16CValidationStateP11CBlockIndexR15CCoinsViewCacheR13CCustomCSViewRK12CChainParamsRbb
31: 0x57906b1f81cb - _ZN11CChainState10ConnectTipER16CValidationStateRK12CChainParamsP11CBlockIndexRKSt10shared_ptrIK6CBlockER12ConnectTraceR29DisconnectedBlockTransactions
32: 0x57906b1f9b99 - _ZN11CChainState21ActivateBestChainStepER16CValidationStateRK12CChainParamsP11CBlockIndexRKSt10shared_ptrIK6CBlockERbR12ConnectTrace
33: 0x57906b1fc4d1 - _ZN11CChainState17ActivateBestChainER16CValidationStateRK12CChainParamsSt10shared_ptrIK6CBlockE
34: 0x57906b2068ea - _Z15ProcessNewBlockRK12CChainParamsSt10shared_ptrIK6CBlockEbPb
35: 0x57906b03eafd - _ZL14ProcessMessageP5CNodeRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEER11CDataStreamlRK12CChainParamsP8CConnmanRKSt6atomicIbE
36: 0x57906b033bdc - _ZN19PeerLogicValidation15ProcessMessagesEP5CNodeRSt6atomicIbE
37: 0x57906b042397 - _ZThn8_N19PeerLogicValidation15ProcessMessagesEP5CNodeRSt6atomicIbE
38: 0x57906b0198a9 - _ZN8CConnman20ThreadMessageHandlerEv
39: 0x57906b029c3a - _ZNSt17_Function_handlerIFvvESt5_BindIFM8CConnmanFvvEPS2_EEE9_M_invokeERKSt9_Any_data
40: 0x57906b001136 - _Z11TraceThreadISt8functionIFvvEEEvPKcT_
41: 0x57906b029dfc - _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvPKcSt8functionIFvvEEES4_S7_EEEEE6_M_runEv
42: 0x57906c543194 - execute_native_thread_routine
43: 0x78b6928db39d - <unknown>
44: 0x78b69296049c - <unknown>
45: 0x0 - <unknown>
thread '<unnamed>' panicked at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cxx-1.0.128/src/unwind.rs:37:9:
panic in ffi function ain_rs_exports::ffi::evm_try_unsafe_create_template, aborting.
stack backtrace:
0: 0x57906bc108af - std::backtrace_rs::backtrace::libunwind::trace::h1a07e5dba0da0cd2
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/../../backtrace/src/backtrace/libunwind.rs:105:5
1: 0x57906bc108af - std::backtrace_rs::backtrace::trace_unsynchronized::h61b9b8394328c0bc
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: 0x57906bc108af - std::sys_common::backtrace::_print_fmt::h1c5e18b460934cff
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:68:5
3: 0x57906bc108af - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h1e1a1972118942ad
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:44:22
4: 0x57906b9e20bb - core::fmt::rt::Argument::fmt::h07af2b4071d536cd
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/fmt/rt.rs:165:63
5: 0x57906b9e20bb - core::fmt::write::hc090a2ffd6b28c4a
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/fmt/mod.rs:1157:21
6: 0x57906bbe3b52 - std::io::Write::write_fmt::h8898bac6ff039a23
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/io/mod.rs:1832:15
7: 0x57906bc12399 - std::sys_common::backtrace::_print::h4e80c5803d4ee35b
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:47:5
8: 0x57906bc12399 - std::sys_common::backtrace::print::ha96650907276675e
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:34:9
9: 0x57906bc11bbe - std::panicking::default_hook::{{closure}}::h215c2a0a8346e0e0
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:271:22
10: 0x57906bc12e59 - std::panicking::default_hook::h207342be97478370
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:298:9
11: 0x57906bc12e59 - std::panicking::rust_panic_with_hook::hac8bdceee1e4fe2c
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:795:13
12: 0x57906bc12702 - std::panicking::begin_panic_handler::{{closure}}::h00d785e82757ce3c
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:664:13
13: 0x57906bc12659 - std::sys_common::backtrace::__rust_end_short_backtrace::h1628d957bcd06996
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:171:18
14: 0x57906bc12646 - rust_begin_unwind
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:652:5
15: 0x57906ab74b72 - core::panicking::panic_fmt::hdc63834ffaaefae5
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:72:14
16: 0x57906ab75be7 - <cxx::unwind::Guard as core::ops::drop::Drop>::drop::h3c636553f50760c4
17: 0x57906b41e635 - cxxbridge1$evm_try_unsafe_create_template
18: 0x57906b259f20 - _Z30evm_try_unsafe_create_templateR19CrossBoundaryResultmN4rust10cxxbridge13StrEjmm
19: 0x57906ac81036 - _ZN15CScopedTemplate6CreateEmNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEjmm
20: 0x57906b1e8ec4 - _ZN11CChainState12ConnectBlockERK6CBlockR16CValidationStateP11CBlockIndexR15CCoinsViewCacheR13CCustomCSViewRK12CChainParamsRbb
21: 0x57906b1f81cb - _ZN11CChainState10ConnectTipER16CValidationStateRK12CChainParamsP11CBlockIndexRKSt10shared_ptrIK6CBlockER12ConnectTraceR29DisconnectedBlockTransactions
22: 0x57906b1f9b99 - _ZN11CChainState21ActivateBestChainStepER16CValidationStateRK12CChainParamsP11CBlockIndexRKSt10shared_ptrIK6CBlockERbR12ConnectTrace
23: 0x57906b1fc4d1 - _ZN11CChainState17ActivateBestChainER16CValidationStateRK12CChainParamsSt10shared_ptrIK6CBlockE
24: 0x57906b2068ea - _Z15ProcessNewBlockRK12CChainParamsSt10shared_ptrIK6CBlockEbPb
25: 0x57906b03eafd - _ZL14ProcessMessageP5CNodeRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEER11CDataStreamlRK12CChainParamsP8CConnmanRKSt6atomicIbE
26: 0x57906b033bdc - _ZN19PeerLogicValidation15ProcessMessagesEP5CNodeRSt6atomicIbE
27: 0x57906b042397 - _ZThn8_N19PeerLogicValidation15ProcessMessagesEP5CNodeRSt6atomicIbE
28: 0x57906b0198a9 - _ZN8CConnman20ThreadMessageHandlerEv
29: 0x57906b029c3a - _ZNSt17_Function_handlerIFvvESt5_BindIFM8CConnmanFvvEPS2_EEE9_M_invokeERKSt9_Any_data
30: 0x57906b001136 - _Z11TraceThreadISt8functionIFvvEEEvPKcT_
31: 0x57906b029dfc - _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvPKcSt8functionIFvvEEES4_S7_EEEEE6_M_runEv
32: 0x57906c543194 - execute_native_thread_routine
33: 0x78b6928db39d - <unknown>
34: 0x78b69296049c - <unknown>
35: 0x0 - <unknown>
thread '<unnamed>' panicked at library/core/src/panicking.rs:228:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.
Aborted (core dumped)
it seems btw to work with https://defi-snapshots-europe.s3.eu-central-1.amazonaws.com/snapshot-mainnet-4268284.zip
-
with this snapshot i have measured up to 2334 open descriptors
-
with https://defi-snapshots-europe.s3.eu-central-1.amazonaws.com/snapshot-mainnet-4423468.zip
-
i have measured up to 2532 open descriptors
but this seems not to be the problem, cause the fd count already drops to around 1600 before the error occurs
- increasing the soft limit by a factor of 4 (ulimit -n 4096) we see an increased discriptor count of 2298 but than results in
2024-10-16T20:52:58Z [vsdb_core::common::engines::rocks_db] rocksdb write_buffer_size per column family = 512MB
2024-10-16T20:52:58Z UpdateTip: height=4423469 hash=e41b0298011b8e76244eebe13db5bdd6e6c3a23bfe561fb8112114f98174698e date='2024-10-15T20:00:42Z' tx=34709360 log2_work=88.520461 progress=0.999525
*** buffer overflow detected ***: terminated
*** buffer overflow detected ***: terminated
script to measure open file descriptors
ps axco pid,comm | grep defid | xargs -n 2 -- bash -c 'echo -n "$1 $2 ";lsof -p $1 2>/dev/null | wc -l' argv0
(Please note that lsof | wc -l sums up a lot of duplicated entries (forked processes can share file handles etc).)
Related: https://github.com/facebook/rocksdb/issues/10367
Thanks for the detailed info @skutcher. We're looking at options to workaround this.
For now, please push the kernel limits and ulimits to higher. We've observed anything 16k+ usually works well. We'll look at a more refined fix soon.
yfyi: my default system limit:
limit -Hn
524288
ulimit -Sn
1024
but as mentioned above if i increase the softlimit i got just buffer overflows
but as mentioned above if i increase the softlimit i got just buffer overflows
I assume this after it already failed and corrupted the data due to open files? Once the data is corrupt, all bets are off I'm afraid. Even if it doesn't error with something else, the integrity checks will fail to ensure the node doesn't start with bad data.
This is expected behavior. If you have issues with a clean data / snapshot, please let us know. We do run our machines with 1024 * 1024 or 16k as the file open limits depending on the system.
We do intend to workaround the rocksdb bug, but however, meanwhile the above should be a viable workaround. I've also expanded a bit more on https://github.com/DeFiCh/ain/issues/3101#issuecomment-2436546577
but as mentioned above if i increase the softlimit i got just buffer overflows
I assume this after it already failed and corrupted the data due to open files? Once the data is corrupt, all bets are off I'm afraid. Even if it doesn't error with something else, the integrity checks will fail to ensure the node doesn't start with bad data.
I am pretty sure it was with an fresh snapshot of https://defi-snapshots-europe.s3.eu-central-1.amazonaws.com/snapshot-mainnet-4423468.zip . I will check again to be 100% sure
We were able to test with stricts ulimit < 1024, and now the default should work well with the default of 1024 even though it's recommended to have higher limits.
Believe this issue should be resolved with the latest master.
Closing, as this is now resolved. Please feel free to reopen if you continue to have issues with >= 4.2.1.