aptos-core icon indicating copy to clipboard operation
aptos-core copied to clipboard

[node-metrics] Ability to collect system metrics from Aptos nodes

Open ibalajiarun opened this issue 3 years ago • 4 comments

Description

This PR introduces the ability to collect platform-agnostic system metrics from Aptos nodes and expose them as prometheus metrics. The metrics are also exported to Aptos via telemetry service. There are multiple collectors to collect various categories of system metrics (CPU, memory, disk, etc) using the sysinfo crate.

Test Plan

Added unit tests

Below is the output from local testing:

# HELP node_disk_available_space Total available disk size in bytes
# TYPE node_disk_available_space counter
node_disk_available_space{file_system="apfs",name="Data",type="SSD"} 116468510720
# HELP node_disk_total_space Total disk size in bytes
# TYPE node_disk_total_space counter
node_disk_total_space{file_system="apfs",name="Data",type="SSD"} 994662584320
# HELP node_loadavg_load1 1m load average.
# TYPE node_loadavg_load1 gauge
node_loadavg_load1 25.810546875
# HELP node_loadavg_load15 15m load average.
# TYPE node_loadavg_load15 gauge
node_loadavg_load15 6.31591796875
# HELP node_loadavg_load5 5m load average.
# TYPE node_loadavg_load5 gauge
node_loadavg_load5 10.93896484375
# HELP node_network_total_packets_received Total number of incoming packets
# TYPE node_network_total_packets_received counter
node_network_total_packets_received{interface_name="en0"} 4476845
# HELP node_network_total_packets_transmitted Total number of outgoing packets
# TYPE node_network_total_packets_transmitted counter
node_network_total_packets_transmitted{interface_name="en0"} 1953492
# HELP node_network_total_received Total number of received bytes
# TYPE node_network_total_received counter
node_network_total_received{interface_name="en0"} 4156922880
# HELP node_network_total_transmitted Total number of transmitted bytes
# TYPE node_network_total_transmitted counter
node_network_total_transmitted{interface_name="en0"} 930886656
# HELP node_process_cpu_usage CPU usage.
# TYPE node_process_cpu_usage gauge
node_process_cpu_usage 781.2310180664063
# HELP node_process_disk_total_read_bytes Total bytes read.
# TYPE node_process_disk_total_read_bytes gauge
node_process_disk_total_read_bytes 6434816
# HELP node_process_disk_total_written_bytes Total bytes written.
# TYPE node_process_disk_total_written_bytes gauge
node_process_disk_total_written_bytes 23715840
# HELP node_process_memory Memory usage in bytes.
# TYPE node_process_memory gauge
node_process_memory 231948
# HELP node_process_run_time Run time of the process in seconds.
# TYPE node_process_run_time gauge
node_process_run_time 30
# HELP node_process_start_time Starts time of the process in seconds since epoch.
# TYPE node_process_start_time gauge
node_process_start_time 1663016819
# HELP node_process_virtual_memory Virtual memory usage in bytes.
# TYPE node_process_virtual_memory gauge
node_process_virtual_memory 420680204
# HELP node_system_cpu_info CPU information.
# TYPE node_system_cpu_info gauge
node_system_cpu_info{brand="Apple M1 Max",vendor="Apple"} 1
# HELP node_system_cpu_usage CPU usage.
# TYPE node_system_cpu_usage gauge
node_system_cpu_usage{cpu_id="0"} 100
node_system_cpu_usage{cpu_id="cpu10_idx10"} 100
node_system_cpu_usage{cpu_id="cpu1_idx1"} 100
node_system_cpu_usage{cpu_id="cpu2_idx2"} 100
node_system_cpu_usage{cpu_id="cpu3_idx3"} 100
node_system_cpu_usage{cpu_id="cpu4_idx4"} 100
node_system_cpu_usage{cpu_id="cpu5_idx5"} 100
node_system_cpu_usage{cpu_id="cpu6_idx6"} 100
node_system_cpu_usage{cpu_id="cpu7_idx7"} 100
node_system_cpu_usage{cpu_id="cpu8_idx8"} 100
node_system_cpu_usage{cpu_id="cpu9_idx9"} 100
# HELP node_system_mem_free Memory free.
# TYPE node_system_mem_free gauge
node_system_mem_free 1588800
# HELP node_system_mem_total Memory total.
# TYPE node_system_mem_total counter
node_system_mem_total 34359738
# HELP node_system_mem_used Memory used.
# TYPE node_system_mem_used gauge
node_system_mem_used 32770938
# HELP node_system_swap_free Swap memory free.
# TYPE node_system_swap_free gauge
node_system_swap_free 0
# HELP node_system_swap_total Swap memory total.
# TYPE node_system_swap_total counter
node_system_swap_total 0
# HELP node_system_swap_used Swap memory used.
# TYPE node_system_swap_used gauge
node_system_swap_used 0

This change is Reviewable

ibalajiarun avatar Sep 12 '22 21:09 ibalajiarun

Can we have break down for CPU usage? Can we have both rss and working_set for memory usage? Can we have IOPS?

@grao1991 Yes, that will be in a separate PR. It seems we can only collect those explicitly for linux only for now. This PR is for platform agnostic metrics.

ibalajiarun avatar Sep 12 '22 22:09 ibalajiarun

@ibalajiarun - As a follow up from this, can you make sure to add/update the grafana dashboards to monitor these newly added metrics?

sitalkedia avatar Sep 13 '22 20:09 sitalkedia

Thanks a lot, @ibalajiarun for collecting these very useful system metrics. Do you know what's the overhead of collecting these metrics and how frequently are we collecting them?

@sitalkedia the overhead is 100 micro seconds per collect call as measured in our recommended hardware spec, so I set the default collect frequency of 15 seconds. I've also added latency histogram to monitor the collect calls, which I will add to dashboard. If it seems to hit us, we can tune the collection frequency.

ibalajiarun avatar Sep 13 '22 21:09 ibalajiarun

Forge is running suite land_blocking on 0b656bc3b9034725ecdfb9dd0784f99a1514daef

Forge is running suite compat on testnet ==> 0b656bc3b9034725ecdfb9dd0784f99a1514daef

:white_check_mark: Forge suite compat success on testnet ==> 0b656bc3b9034725ecdfb9dd0784f99a1514daef

Compatibility test results for testnet ==> 0b656bc3b9034725ecdfb9dd0784f99a1514daef (PR)
1. Check liveness of validators at old version: testnet
compatibility::simple-validator-upgrade::liveness-check : 6806 TPS, 3997 ms latency, 6400 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: 0b656bc3b9034725ecdfb9dd0784f99a1514daef
compatibility::simple-validator-upgrade::single-validator-upgrade : 5434 TPS, 4862 ms latency, 6800 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: 0b656bc3b9034725ecdfb9dd0784f99a1514daef
compatibility::simple-validator-upgrade::half-validator-upgrade : 4869 TPS, 5495 ms latency, 8300 ms p99 latency,no expired txns
4. upgrading second batch to new version: 0b656bc3b9034725ecdfb9dd0784f99a1514daef
compatibility::simple-validator-upgrade::rest-validator-upgrade : 6588 TPS, 4123 ms latency, 6700 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet ==> 0b656bc3b9034725ecdfb9dd0784f99a1514daef passed
Test Ok

:white_check_mark: Forge suite land_blocking success on 0b656bc3b9034725ecdfb9dd0784f99a1514daef

performance benchmark with full nodes : 7319 TPS, 4054 ms latency, 6400 ms p99 latency,no expired txns
Test Ok

Forge is running suite land_blocking on 03470a3ecf5fb6a926dd784f0be08208508c0604

:white_check_mark: Forge suite land_blocking success on 03470a3ecf5fb6a926dd784f0be08208508c0604

performance benchmark with full nodes : 7430 TPS, 4006 ms latency, 6600 ms p99 latency,no expired txns
Test Ok

Forge is running suite land_blocking on 4c7b65ab82b47a50cb650ce0e8fca77880871f46

Forge is running suite compat on testnet ==> 4c7b65ab82b47a50cb650ce0e8fca77880871f46

:white_check_mark: Forge suite compat success on testnet ==> 4c7b65ab82b47a50cb650ce0e8fca77880871f46

Compatibility test results for testnet ==> 4c7b65ab82b47a50cb650ce0e8fca77880871f46 (PR)
1. Check liveness of validators at old version: testnet
compatibility::simple-validator-upgrade::liveness-check : 7098 TPS, 3809 ms latency, 5700 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: 4c7b65ab82b47a50cb650ce0e8fca77880871f46
compatibility::simple-validator-upgrade::single-validator-upgrade : 5018 TPS, 5686 ms latency, 9300 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: 4c7b65ab82b47a50cb650ce0e8fca77880871f46
compatibility::simple-validator-upgrade::half-validator-upgrade : 5709 TPS, 4677 ms latency, 6400 ms p99 latency,no expired txns
4. upgrading second batch to new version: 4c7b65ab82b47a50cb650ce0e8fca77880871f46
compatibility::simple-validator-upgrade::rest-validator-upgrade : 6639 TPS, 3788 ms latency, 6000 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet ==> 4c7b65ab82b47a50cb650ce0e8fca77880871f46 passed
Test Ok

:white_check_mark: Forge suite land_blocking success on 4c7b65ab82b47a50cb650ce0e8fca77880871f46

performance benchmark with full nodes : 7455 TPS, 3979 ms latency, 6600 ms p99 latency,no expired txns
Test Ok

github-actions[bot] avatar Sep 13 '22 23:09 github-actions[bot]