Flaws in some definitions in kubernetes-prometheusRule.yaml

Open allenhsu opened this issue 4 years ago • 1 comments

What happened?

Recently we started receiving KubeAPIErrorBudgetBurn alerts. When looking into the historical data, I noticed that the metrics, e.g. apiserver_request:burnrate1d is empty before the alert. After digging into the definition, I found that there are some flaws in some definitions in kubernetes-prometheusRule.yaml. Actually the metrics didn't change too much before and after the alert, but the 5xx errors from apiserver made this metrics available since then.

The sum by (cluster) (rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET",code=~"5.."}[1d])) will cause the metrics not available until 5xx error happens.

Did you expect to see some different?

I expect the metrics to produce correct numbers even before any 5xx error happens. The wrong definitions should be fixed.

How to reproduce it (as minimally and precisely as possible):

https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/kubernetes-prometheusRule.yaml#L785

sum by (cluster) (rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET",code=~"5.."}[1d]))

should be replaced by

sum by (cluster) (rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET",code=~"5.."}[1d]) or vector(0))

It also applies to other similar metrics.

Environment

It's environment irrelevant.

Anything else we need to know?:

Nope

Nov 03 '21 08:11 allenhsu

The alert definition is coming from https://github.com/kubernetes-monitoring/kubernetes-mixin/ project. Please file an issue there.

Nov 03 '21 10:11 paulfantom