Alert BOSHJobHighCPULoad does not take number of CPUs into account
The BOSHJobHighCPULoad Alert queries the bosh_job_load_avg01 metric. The problem is that the warning threshold value for this metric is dependend on the number of CPUs. Generally speaking 100% cpu load on 1 core is indicated by a 1. 100% cpu load on a 16 core machine would result in a 16.
Since the query for this alert does not divide the mtric by the number of CPUs it completely useless: On a 1 core machine the default of 5 will mean that you have to fix it immediately. On a 4 core machine a load average of 5 still indicates a slight problem. But on an 8 core machine a load avg of 5 absolutely fine.
So my request: can the load avg please be divided by the number of cpus in the machine before comparing it to the threshold value?
Agreed, the problem is that BOSH does NOT provide the number of CPUs, so we cannot calculate properly that metric/alert.
One workaround will be to create System alerts (see pending issue https://github.com/cloudfoundry-community/prometheus-boshrelease/issues/38). Metrics will come from the node_exporter, and those metrics contain both load averages and number of cpus. I still need to work on those alerts, once done, I'd suggest switching to them.
I just run into this great article on the meaning of load averages. I thought I'd just leave it here for anyone trying to draw conclusions from this metric: http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html. Also, monitoring "better metrics" recommended by the author could be a great addition to the dashboards/alerts.
Wouldn't it be better if the alert was based on the bosh_job_cpu_user metric?
This issue is stale because it has been open 60 days with no activity. Comment or this will be closed in 5 days.
This issue was automatically closed because it has been stalled for 5 days with no activity.