casper-node KeepUp Node panics at switch block with load

RC1 Testing:

KeepUp Nodes shutting down at/near switch block.

Network: 5 + 2 Name: ajith-condor-small (8s) Branch: feat-2.0 Commit: fe996ee6bf859afec1925f01baf60f900bae6a77 Config: {"config":{}, "chainspec":{"era_duration": "30min", "minimum_block_time": "8192 ms", "block_max_mint_count": 325, "block_max_auction_count": 50, "block_max_standard_count": 15}} Genesis: feat-2.0 (fe996ee6bf859afec1925f01baf60f900bae6a77) Test: Transfer and WASM Load Testing 30k Transfers 2.6k WASM Transfers :red_circle: Issue: KeepUp Node panics at switch block with load Observations:

network running in 8s blocks
KeepUp Nodes shutting down at/near switch block
when active loading/executing transactions near to switch block
node recovers to KeepUp after era change
sidecar rpc service was impacted
errors in log

"message":"dropped without being responded to outside of shutdown","responder":"Responder<()>" "message":"request for channel closed, this may be a bug? check if a component is stuck from now on","err":"oneshot canceled","channel":""()""},"target":"casper_node::effect" "message":"fatal error via control announcement","file":"node/src/components/storage.rs","line":"230","msg":"storage error: duplicate entries for blocks for transaction deploy-hash(8da2..9081): block-hash(257f..c272) / block-hash(d939..9270)" Validating nodes in healthy state avg finalisation times of 12ms avg finalisation times on load 225ms no missed proposals were observed always in 'validate' state Grafana: link http://genesis.casperlabs.io/ajith-condor-small/casper-node-dumps/ajith-condor-small/01052024_0123/172.142.93.244_sidecar.log.tar.gz

May 06 '24 15:05 cspramit

Slack Observations by Bart - https://casper-labs-team.slack.com/archives/C01ULKL8G8J/p1715000501640239?thread_ts=1714527473.997799&cid=C01ULKL8G8J

@SaiProServ - keeping in triaging loop, currently Karan is assigned to fix after @fizyk20 investigated the issue. @thamps-casp

May 06 '24 15:05 cspramit

You can always omit these two lines from error dumps:

"message":"dropped without being responded to outside of shutdown","responder":"Responder<()>" "message":"request for channel closed, this may be a bug? check if a component is stuck from now on","err":"oneshot canceled","channel":""()""},"target":"casper_node::effect"

They are always emitted when a node shuts down unexpectedly and dumps its event q. The third message after that noise provides the actual reason for the fatal shutdown, and is thus the one we want to see.

May 06 '24 18:05 EdHastingsCasperAssociation

Bart reproduced this and determined the cause: the order of operations around determining block vacancy for Validate and KeepUp nodes is slightly different due to lifecycle issues...Validate nodes produce blocks and then store them after executing them then share those blocks with KeepUp nodes. The KeepUp nodes first store such blocks and then execute them.

This small difference was not considered by the block vacancy calculation logic to determine dynamic gas pricing, effectively causing an off-by-one (block, in this case) runtime error. Under light load the outcome for gas price results in no change and thus the issue goes unnoticed. Under heavy load however, depending on the calculation the Validate set and the KeepUp set can come to different conclusions. If / when this occurs, the KeepUp nodes are hosed and would require a db utils correction to be able to rejoin the network.

The fix is straightforward: fix the calculating logic to be aware of the slight state based difference in flow such that KeepUp nodes conform to the same inclusion that Validate nodes follow, and thus avoid the off by one error, etc.

May 06 '24 18:05 EdHastingsCasperAssociation