Forcing rollback after Out Of Memory kill causes Panic with nil pointer
We are facing a couple of times already very high memory consumption (128Gb) and OOM kills on backend. The Rollback fails and causes Panic exit on nil pointer see logs below.
How to repair the /db as a full resync takes 4-5 days. Any help appreciated.
Is there proper garbage collection done for Go? Any advice on environment setting for that?
badger 2025/03/11 12:15:37 INFO: Block cache metrics: hit: 6 miss: 859463 keys-added: 146367 keys-updated: 0 keys-evicted: 76523 cost-added: 569568571 cost-evicted: 301133431 sets-dropped: 0 sets-rejected: 709775 gets-dropped: 0 gets-kept: 857088 gets-total: 859469 hit-ratio: 0.00 I0311 12:17:25.728528 1 config.go:227] Logging to directory /db/v-00000 I0311 12:17:25.728671 1 config.go:228] Running node in MAINNET mode I0311 12:17:25.728677 1 config.go:229] Data Directory: /db/v-00000 I0311 12:17:25.728684 1 config.go:240] PoS Validator: ON I0311 12:17:25.728687 1 config.go:244] HyperSync: ON I0311 12:17:25.728690 1 config.go:250] ForceChecksum: OFF - This could allow a peer to trick you into downloading bad hypersync state. Be sure you're connecting to a trustworthy sync peer. I0311 12:17:25.728694 1 config.go:256] SnapshotBlockHeightPeriod: 600000 I0311 12:17:25.728701 1 config.go:260] ArchivalMode: ON I0311 12:17:25.728704 1 config.go:263] SyncType: blocksync I0311 12:17:25.728708 1 config.go:270] Connect IPs: [deso-seed-4.io:17000] I0311 12:17:25.728712 1 config.go:293] Max Inbound Peers: 125 I0311 12:17:25.728716 1 config.go:294] Protocol listening on port 17000 I0311 12:17:25.728720 1 config.go:300] Rate Limit Feerate: 0 I0311 12:17:25.728723 1 config.go:301] Min Feerate: 1000 badger 2025/03/11 12:17:26 INFO: All 3320 tables opened in 386ms badger 2025/03/11 12:17:26 INFO: Discard stats nextEmptySlot: 0 badger 2025/03/11 12:17:26 INFO: Set nextTxnTs to 672795186 badger 2025/03/11 12:17:26 INFO: Deleting empty file: /db/v-00000/badgerdb/000566.vlog E0311 12:17:26.308269 1 snapshot.go:448] NewSnapshot: Node didn't shut down properly last time. Entering a recovery mode. The node will roll back to last snapshot epoch block height (13200000) and hash (e2f8cee35bcae255443c6c294af965240d8fb19e308c7ece7dcb465a61beb815), then restart. I0311 12:22:59.231991 1 config.go:227] Logging to directory /db/v-00000 I0311 12:22:59.232119 1 config.go:228] Running node in MAINNET mode I0311 12:22:59.232125 1 config.go:229] Data Directory: /db/v-00000 I0311 12:22:59.232132 1 config.go:240] PoS Validator: ON I0311 12:22:59.232135 1 config.go:244] HyperSync: ON I0311 12:22:59.232138 1 config.go:250] ForceChecksum: OFF - This could allow a peer to trick you into downloading bad hypersync state. Be sure you're connecting to a trustworthy sync peer. I0311 12:22:59.232143 1 config.go:256] SnapshotBlockHeightPeriod: 600000 I0311 12:22:59.232148 1 config.go:260] ArchivalMode: ON I0311 12:22:59.232152 1 config.go:263] SyncType: blocksync I0311 12:22:59.232156 1 config.go:270] Connect IPs: [deso-seed-4.io:17000] I0311 12:22:59.232161 1 config.go:293] Max Inbound Peers: 125 I0311 12:22:59.232164 1 config.go:294] Protocol listening on port 17000 I0311 12:22:59.232167 1 config.go:300] Rate Limit Feerate: 0 I0311 12:22:59.232171 1 config.go:301] Min Feerate: 1000 badger 2025/03/11 12:22:59 INFO: All 3320 tables opened in 387ms badger 2025/03/11 12:22:59 INFO: Discard stats nextEmptySlot: 0 badger 2025/03/11 12:22:59 INFO: Set nextTxnTs to 672795186 badger 2025/03/11 12:22:59 INFO: Deleting empty file: /db/v-00000/badgerdb/000567.vlog E0311 12:22:59.775777 1 snapshot.go:448] NewSnapshot: Node didn't shut down properly last time. Entering a recovery mode. The node will roll back to last snapshot epoch block height (13200000) and hash (e2f8cee35bcae255443c6c294af965240d8fb19e308c7ece7dcb465a61beb815), then restart. badger 2025/03/11 12:23:59 INFO: Block cache metrics: hit: 5 miss: 878509 keys-added: 146436 keys-updated: 1 keys-evicted: 76591 cost-added: 569837756 cost-evicted: 301402693 sets-dropped: 0 sets-rejected: 728752 gets-dropped: 0 gets-kept: 875968 gets-total: 878515 hit-ratio: 0.00 E0311 12:25:22.716425 1 miner.go:410] BigintToHash: Bigint is nil E0311 12:25:22.716444 1 miner.go:410] BigintToHash: Bigint is nil I0311 12:25:22.716913 1 legacy_mempool.go:2609] Calling StartReadOnlyUtxoViewRegenerator... E0311 12:25:22.722465 1 server.go:712] NewServer: Forcing a rollback to the last snapshot epoch because node was not closed properly last time I0311 12:25:22.722496 1 node.go:381] Node.listenToNodeMessages: Stopping node I0311 12:25:22.722517 1 node.go:317] Node is shutting down. This might take a minute. Please don't close the node now or else you might corrupt the state. I0311 12:25:22.722525 1 node.go:321] Node.Stop: Stopping server... I0311 12:25:22.722532 1 server.go:3188] Server.Stop: Gracefully shutting down Server I0311 12:25:22.722537 1 server.go:3191] Server.Stop: Flushing logs for all peers panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1676827]
goroutine 53 [running]: github.com/deso-protocol/core/lib.(*Server).Stop(0x0) /deso/src/core/lib/server.go:3192 +0x107 github.com/deso-protocol/core/cmd.(*Node).Stop(0xc000a33680) /deso/src/core/cmd/node.go:322 +0x15c github.com/deso-protocol/core/cmd.(*Node).listenToNodeMessages(0xc000a33680, {0x0, 0x0, 0x0}) /deso/src/core/cmd/node.go:382 +0x12c created by github.com/deso-protocol/core/cmd.(*Node).Start /deso/src/core/cmd/node.go:79 +0x365
BTW, we have seen a reduction in memory by setting MAX_PEERS=50 which seems to have effect on one node, but not on all. Any recommendation?
@lazynina this is affecting at least three nodes
As Carry explains, we've been looking at tweaking max peers, timeouts and also looking to restrict the peers (especially initial connections) to optimise further.
Through experiments (between operators) we also identified a huge list of "peers" that are dead, yet still seem to be pulled into the pool of peer requests.
I think @carry2web also found that there were a large number of connections that are left open rather than being closed down.
But, limited information to determine why/what is causing the ridiculous memory spikes... but, I know you mentioned in POS about header storage and memory issues as a result.
Nobody any suggestions? Is the only thing left to cleanup the db and resync (4 days)? Can the db not been clipped from (a couple of) days back and resync from there?
Thank you for your replies in TG @lazynina "So you should see “NewServer: Not forcing a rollback to the last snapshot epoch even though the node was not properly closed last time” if force checksum is set to false. "
As can be seen from the top here, the ForecChecksum is OFF, yet the rollback and panic occurs
@carry2web - I think the immediate solution would be to resync a node. I'll continue to investigate what happened here. Do you have an idea of what version was running on this node?
We used :stable until you said :latest, no matter what version, same results. Anything in the log that reports what version?
@carry2web - unfortunately the solution here is to resync the node from scratch or recover from a snapshot if you have one taken.
We will continue to investigate the panic you hit.
@SafetyNet-deso - there is nothing in the logs about the version. you can check from hitting the node version endpoint, but that won't work here since the http server isn't starting up.
Thanks, we are rebuilding. BTW:Sorry for the comment by @SafetyNet-deso that was me logged in on alternative account, without noticing at time of comment ;-)