configmapsecrets icon indicating copy to clipboard operation
configmapsecrets copied to clipboard

Record metrics for render failures

Open Niksko opened this issue 4 years ago • 3 comments

At the moment, it's hard to monitor this service, because render failures aren't recorded. It would be nice if the existing metrics endpoint exposed these metrics.

Happy to submit a PR

Niksko avatar Apr 27 '21 23:04 Niksko

Never mind, this is already recorded as part of the standard controller runtime reconcile metrics. Just make sure you're looking at the instance that is the leader, otherwise they won't show up 😅

Niksko avatar Apr 28 '21 06:04 Niksko

Actually, the currently exported metrics don't really fit the purpose of alerting on render failures. What we need is some sort of condition gauge like Flux has: https://github.com/fluxcd/pkg/blob/main/runtime/metrics/recorder.go

Again, happy to submit a PR for this

Niksko avatar Apr 28 '21 07:04 Niksko

For a little bit of context, the intended design was to allow the separation of (human) operators and users. For instance one team may run one CMS controller for an entire cluster and multiple other teams may use CMS objects in the cluster.

As you found, there is a builtin controller_runtime_reconcile_total metric that includes a result label (success, error, requeue, or requeue_after). It was a conscious decision to not treat cases where a required secret/configmap object/key is missing as a reconcile error, rather an info-level warning is logged and the CMS is requeue'd. An error would indicate that something is actually wrong (e.g. RBAC) and may need (human) operator intervention. A requeue would indicate that a user hasn't configured their CMS properly (and operators should sleep through the night without getting paged).

I went ahead and added an explicit configmapsecret_controller_missing_value_render_errors_total metric, which includes a label for the namespace of the CMS. A gauge, as you suggested, would probably be nicer since the reconcile retries will eventually backoff to ~15m. Another possible solution would be to add kube-state-metrics-like support for all CMS instances with their current status.

abursavich avatar Apr 28 '21 18:04 abursavich