casper-node icon indicating copy to clipboard operation
casper-node copied to clipboard

SRE Testing: Investigate some aberrant behavior

Open cspramit opened this issue 1 year ago • 3 comments

Below is the test output , Engineering: please look under the section "some different behaviours observed" to investigate the aberrant behavior , if it has any substance.

SRE @Aj : Please look at the SSE event not reporting data further and update the comments on the ticket

Network: 100 + 20 Name: ajith-condor-big (8s) Branch: feat-2.0 Commit: ee9c6de38fb93076db68258f40a17ff8b0f382dc Config: minimum_block_time = '8192 ms'

block_gas_limit = 3_300_000_000_000

native_mint_lane = [0, 1024, 1024, 65_000_000_000, 325] native_auction_lane = [1, 2048, 2048, 362_500_000_000, 75] wasm_lanes = [[2, 1_048_576, 2048, 1_000_000_000_000, 1], [3, 344_064, 1024, 100_000_000_000, 10], [4, 172_032, 1024, 50_000_000_000, 15], [5, 12_288, 512, 1_500_000_000, 25]] Tests: Mixed Load of V1 and V2 Transfers, WASM Transfers, Auction Observations: time spent at switch block now under 5seconds allocated memory increases as new deploys in buffer allocated memory spikes by +1G at switch block at load when average memory was 1.1G, switchblock spiked to 2+ Gig block timestamps were stable and under 8s through the loading and execution period was able to hit 325 native transfers per block consistently also observed Auction Lane saturated at 75 periodically (as per load and config) also observed V1 wasm deploys allocated to Lane 3 at 10 (as per load and config)

some different behaviours observed at switch block - the SSE events seem to be not reporting data as per grafana seeing 2mins of missing data - (trying to investigate the SSE event outputs) the casper-node log messages show blocks and deploys executions at this period after all the pending deploys in the network were executed the block timestamps were periodically spiking to 10 to 20s 1 validate node stalled and later synced to tip in idle network at later switch block eras the finalisation times starting spiking to 18+ seconds some nodes were running in 'debug' mode so it could have an IO impact on the node itself Grafana: link Dumps: http://genesis.casperlabs.io/ajith-condor-big/casper-node-dumps/ajith-condor-big/27062024_0401/dump_download_list.html DB Snapshot: genesis.casperlabs.io/release-test-db-snapshots/ajith-condor-big/db.json genesis.casperlabs.io/release-test-db-snapshots/ajith-condor-big/db.tar.zst Nodes (debug) - 172.142.59.225, 172.142.71.238, 172.142.94.215

cspramit avatar Jun 27 '24 12:06 cspramit

Slack Discussions : https://casper-labs-team.slack.com/archives/C01ULKL8G8J/p1719463830035189 @thamps-casp cc Pleae investigate the sse events

cspramit avatar Jun 27 '24 12:06 cspramit

Did further investigations into the SSE and custom SSE exporter.

  1. It can be confirmed that SSE events are streaming correctly. No events were lost and it was confirmed by inspecting the events through custom scripts.
  2. There is a delay of SSE events streaming when the network reaches a switch block. The switch block event is streamed after delay of ~180 seconds and then the remainder of events starts flowing after. This delay causes the prometheus exporter export time and actual 'BlockAdded' event timestamp to be way out of sync.

ex. BlockAdded Event Message Received: 2024-07-02 05:38:55.754281 BlockAdded Event Timestamp in Message: 2024-07-02T05:38:55.645Z

2024-07-02 05:38:55,754 - __main__ - DEBUG - 2024-07-02 05:38:55.754281 UTC: [BlockAdded]
2024-07-02 05:38:55,754 - __main__ - DEBUG - {'block_hash': '38d7480abdb70448d97659404d7743626eabd42ef4a512af086c505d68ab91a0', 'block': {'Version2': {'hash': '38d7480abdb70448d97659404d7743626eabd42ef4a512af086c505d68ab91a0', 'header': {'parent_hash': '1956145b563b52426883268c99522ea18f661c79b3fb0cd12703a252e15a8e65', 'state_root_hash': 'e23b3eafdd8b8c66a3f00cc0208dd47810e4eab937da141db375dd09634afe59', 'body_hash': '15209a8832459e5db7bce8a55abe387d00b79282004a0580e1d1f0606e4b3d5a', 'random_bit': True, 'accumulated_seed': 'd526837f7cb26a87d7acd79122306b4c9db0bed6b451d2f89da5ff6c071e6a2e', 'era_end': None, 'timestamp': '2024-07-02T05:38:55.645Z', 'era_id': 10, 'height': 2195, 'protocol_version': '2.0.0', 'proposer': '0163e2867616bc81762e416d7ed9c1bc78a52ee7e7452cf2098a5332b6f476aea2', 'current_gas_price': 1, 'last_switch_block_hash': 'b5fae62e01f5894999b564d287d06d710c9065700c5f2581b4a334b608cfe8a4'}, 'body': {'transactions': {}, 'rewarded_signatures': [[255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}}}}

Switch Block BlockAdded Event Message Received: 2024-07-02 05:41:55.038500 BlockAdded Event Timestamp in Message: 2024-07-02T05:39:03.838Z (event is from past now as per prometheus)

2024-07-02 05:41:55,038 - __main__ - DEBUG - 2024-07-02 05:41:55.038500 UTC: [BlockAdded]
2024-07-02 05:41:55,039 - __main__ - DEBUG - {'block_hash': 'fa1eff8511efb891bdb9c93d6fc9316c3b250f7b658fb60da52fea00678d93dc', 'block': {'Version2': {'hash': 'fa1eff8511efb891bdb9c93d6fc9316c3b250f7b658fb60da52fea00678d93dc', 'header': {'parent_hash': '38d7480abdb70448d97659404d7743626eabd42ef4a512af086c505d68ab91a0', 'state_root_hash': '2db60b43f6998a28b5fb94a803c1e164905b0de9be2be360e2477c233d01f135', 'body_hash': '15209a8832459e5db7bce8a55abe387d00b79282004a0580e1d1f0606e4b3d5a', 'random_bit': True, 'accumulated_seed': 'ec2af6fe928c71c4f152dc4ed392d9220a37adaae700d561c20dddd32bb288dd', 'era_end': {'equivocators': [], 'inactive_validators': [], 'next_era_validator_weights': [...]},'rewards': {...},'next_era_gas_price': 1}, 'timestamp': '2024-07-02T05:39:03.838Z', 'era_id': 10, 'height': 2196, 'protocol_version': '2.0.0', 'proposer': '01a57f1bd15ef3370e2863591f19a55973a2665a90aa181bc14871878a235d4fae', 'current_gas_price': 1, 'last_switch_block_hash': 'b5fae62e01f5894999b564d287d06d710c9065700c5f2581b4a334b608cfe8a4'}, 'body': {'transactions': {}, 'rewarded_signatures': [[255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}}}}

and thereafter till the time get into sync

2024-07-02 05:41:55,044 - __main__ - DEBUG - 2024-07-02 05:41:55.044877 UTC: [BlockAdded]
2024-07-02 05:41:55,045 - __main__ - DEBUG - {'block_hash': 'a445181fd3b86846f1da834aea40eb9ac4b268c0b70eed3ae11c4c32e5ba1cd4', 'block': {'Version2': {'hash': 'a445181fd3b86846f1da834aea40eb9ac4b268c0b70eed3ae11c4c32e5ba1cd4', 'header': {'parent_hash': 'fa1eff8511efb891bdb9c93d6fc9316c3b250f7b658fb60da52fea00678d93dc', 'state_root_hash': '5dfb7e45790e15ec7d18809f8f4751151a29fa85a83e725f149e60b035ad5693', 'body_hash': '91d47c3500b87606b40fbc75eeb4840c214065a926af1d8a4e269384beed7c57', 'random_bit': True, 'accumulated_seed': '5c22c6234d801eef56b87c49b29ce929cec7e67b74712377417dfb9c77c98955', 'era_end': None, 'timestamp': '2024-07-02T05:39:16.299Z', 'era_id': 11, 'height': 2197, 'protocol_version': '2.0.0', 'proposer': '01d9fb3ae0f4e2798dbf82f14dee390adce67c003e470531cf17b7ca5fc31b9edd', 'current_gas_price': 1, 'last_switch_block_hash': 'fa1eff8511efb891bdb9c93d6fc9316c3b250f7b658fb60da52fea00678d93dc'}, 'body': {'transactions': {}, 'rewarded_signatures': [[247, 251, 183, 250, 253, 255, 223, 255, 255, 239, 255, 223, 240], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}}}}
2024-07-02 05:41:55,052 - __main__ - DEBUG - 2024-07-02 05:41:55.052568 UTC: [BlockAdded]
2024-07-02 05:41:55,052 - __main__ - DEBUG - {'block_hash': '5b9fd8a2c98a9746a791cfd4e75ffafd01da4ae846ec312402611476a9fd6b8a', 'block': {'Version2': {'hash': '5b9fd8a2c98a9746a791cfd4e75ffafd01da4ae846ec312402611476a9fd6b8a', 'header': {'parent_hash': 'a445181fd3b86846f1da834aea40eb9ac4b268c0b70eed3ae11c4c32e5ba1cd4', 'state_root_hash': 'de861dd8c46fa0c867c1d073ee268a46f65cb7f7722d552337b4c5e62c46cf15', 'body_hash': '96d220874e8a5bd72c0597a140e6331d8188c1053140e9f1cb84d46e772420c9', 'random_bit': False, 'accumulated_seed': '710556ad739f0e27c19d55a9797d7a8619eed167522647e8f4e71487ebf1bc8c', 'era_end': None, 'timestamp': '2024-07-02T05:39:24.492Z', 'era_id': 11, 'height': 2198, 'protocol_version': '2.0.0', 'proposer': '018f0160018dc92571933192a52aff4989a9253c2b531d12ef45f8b1b2e7c1b528', 'current_gas_price': 1, 'last_switch_block_hash': 'fa1eff8511efb891bdb9c93d6fc9316c3b250f7b658fb60da52fea00678d93dc'}, 'body': {'transactions': {}, 'rewarded_signatures': [[255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240], [8, 4, 72, 5, 2, 0, 32, 0, 0, 16, 0, 32, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}}}}
...
2024-07-02 05:42:00,313 - __main__ - DEBUG - 2024-07-02 05:42:00.313370 UTC: [BlockAdded]
2024-07-02 05:42:00,313 - __main__ - DEBUG - {'block_hash': '0bcb2b0c01406044cb0ed1ab096ccd8d49c8c547e46b578e49b68cca5f711503', 'block': {'Version2': {'hash': '0bcb2b0c01406044cb0ed1ab096ccd8d49c8c547e46b578e49b68cca5f711503', 'header': {'parent_hash': '752d3d1a70931dafac09edb9f0ea71e5d89baf5a12c0141cab85e9e211483ead', 'state_root_hash': '0d4f938fdd541317db32e618381b728712afd6c9fb2eaeb972538f6a1252e1d6', 'body_hash': '15209a8832459e5db7bce8a55abe387d00b79282004a0580e1d1f0606e4b3d5a', 'random_bit': True, 'accumulated_seed': '3970633503208969d9a49ad2f4d910dffa9aa85f5e9a7b18296ff94dd01290c3', 'era_end': None, 'timestamp': '2024-07-02T05:42:00.162Z', 'era_id': 11, 'height': 2217, 'protocol_version': '2.0.0', 'proposer': '01a75c228a5bbf7fbfd396a58900aa56c9ab7b65055a4ee6d7fe52ad0531440127', 'current_gas_price': 1, 'last_switch_block_hash': 'fa1eff8511efb891bdb9c93d6fc9316c3b250f7b658fb60da52fea00678d93dc'}, 'body': {'transactions': {}, 'rewarded_signatures': [[255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}}}}

thamps-casp avatar Jul 02 '24 05:07 thamps-casp

@alsrdn Alex S will look into

cspramit avatar Jul 18 '24 12:07 cspramit

Re-running the same investigation with latest 2.0 commit 9109c7b3910de54673c642c964a3a8036933332c

Network: 5 +2 Issue: Not observed. The SSE event monitoring does not show any delay of the switch block BlockAdded event. This maybe because of the small network. Next to test with a large 100 + 20 network.

thamps-casp avatar Oct 14 '24 07:10 thamps-casp

Ran test with a 60 + 10 network. Note: 100 + 20 cannot be provisioned due to aws limits enforcement.

With this medium sized network - this issue is not seen.

Switch Block BlockAdded Event Message Received: 2024-10-16T03:28:33.157Z BlockAdded Event Timestamp in Message: 2024-10-16T03:28:15.272Z

thamps-casp avatar Oct 16 '24 03:10 thamps-casp

100 Validators + 1 - large network Block Time: 8secs

  • added continous load to the network with wasm and transfer deploys
  • issue still not observed in the scale of issue raised. 30-40s delay only
  • SSE switch block event lag is under 30s with load.

Switch Block BlockAdded Event Message Received: 2024-10-16T07:46:43.679Z BlockAdded Event Timestamp in Message: 2024-10-16T07:46:14.898Z

thamps-casp avatar Oct 16 '24 08:10 thamps-casp