prometheus-boshrelease icon indicating copy to clipboard operation
prometheus-boshrelease copied to clipboard

Alert BOSHJobHighCPULoad does not take number of CPUs into account

Open vChrisR opened this issue 8 years ago • 3 comments

The BOSHJobHighCPULoad Alert queries the bosh_job_load_avg01 metric. The problem is that the warning threshold value for this metric is dependend on the number of CPUs. Generally speaking 100% cpu load on 1 core is indicated by a 1. 100% cpu load on a 16 core machine would result in a 16.

Since the query for this alert does not divide the mtric by the number of CPUs it completely useless: On a 1 core machine the default of 5 will mean that you have to fix it immediately. On a 4 core machine a load average of 5 still indicates a slight problem. But on an 8 core machine a load avg of 5 absolutely fine.

So my request: can the load avg please be divided by the number of cpus in the machine before comparing it to the threshold value?

vChrisR avatar Aug 03 '17 11:08 vChrisR

Agreed, the problem is that BOSH does NOT provide the number of CPUs, so we cannot calculate properly that metric/alert.

One workaround will be to create System alerts (see pending issue https://github.com/cloudfoundry-community/prometheus-boshrelease/issues/38). Metrics will come from the node_exporter, and those metrics contain both load averages and number of cpus. I still need to work on those alerts, once done, I'd suggest switching to them.

frodenas avatar Aug 08 '17 21:08 frodenas

I just run into this great article on the meaning of load averages. I thought I'd just leave it here for anyone trying to draw conclusions from this metric: http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html. Also, monitoring "better metrics" recommended by the author could be a great addition to the dashboards/alerts.

mkuratczyk avatar Aug 23 '17 13:08 mkuratczyk

Wouldn't it be better if the alert was based on the bosh_job_cpu_user metric?

prolane avatar Dec 03 '18 13:12 prolane

This issue is stale because it has been open 60 days with no activity. Comment or this will be closed in 5 days.

github-actions[bot] avatar Apr 25 '23 13:04 github-actions[bot]

This issue was automatically closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar May 01 '23 02:05 github-actions[bot]