Flaws in some definitions in kubernetes-prometheusRule.yaml
What happened?
Recently we started receiving KubeAPIErrorBudgetBurn alerts. When looking into the historical data, I noticed that the metrics, e.g. apiserver_request:burnrate1d is empty before the alert. After digging into the definition, I found that there are some flaws in some definitions in kubernetes-prometheusRule.yaml. Actually the metrics didn't change too much before and after the alert, but the 5xx errors from apiserver made this metrics available since then.
The sum by (cluster) (rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET",code=~"5.."}[1d])) will cause the metrics not available until 5xx error happens.
Did you expect to see some different?
I expect the metrics to produce correct numbers even before any 5xx error happens. The wrong definitions should be fixed.
How to reproduce it (as minimally and precisely as possible):
https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/kubernetes-prometheusRule.yaml#L785
sum by (cluster) (rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET",code=~"5.."}[1d]))
should be replaced by
sum by (cluster) (rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET",code=~"5.."}[1d]) or vector(0))
It also applies to other similar metrics.
Environment
It's environment irrelevant.
Anything else we need to know?:
Nope
The alert definition is coming from https://github.com/kubernetes-monitoring/kubernetes-mixin/ project. Please file an issue there.