cluster-api Raise a metric whenever CAPI cannot see a remote cluster client

User Story

As an operator I would like a capi to raise a metric whenever it cannot see a remote cluster client to see where we can see a continuous rise in errors contacting the workload cluster client and potentially raise an alert.

Detailed Description Similar to https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/issues/1281, it would be useful to be able to spot when clusters are not contactable from the management cluster so that we can monitor and alert when reconciliation should be paused until such time that the remote cluster is again contactable.

Anything else you would like to add: This also relates to https://github.com/kubernetes-sigs/cluster-api/issues/5394 in that that is asking for more information on the annotations to be used when cluster communication is interrupted

/kind feature

Oct 27 '21 12:10 perithompson

/area health

Oct 27 '21 14:10 killianmuldoon

Praise for the clear definition of the required metric. /milestone v1.2

Jan 26 '22 15:01 fabriziopandini

/assign

Jan 28 '22 12:01 killianmuldoon

All credit to @chrischdi for the following info :slightly_smiling_face:

We have a metric, exposed by client-go through controller-runtime, which reports error responses in the client. It's called rest_client_requests_total and it has the following format:

rest_client_requests_total{code="<error>",host="172.18.0.4:6443",method="GET"} 36

The host IP here is the ControlPlaneEndpoint IP - i.e. from Cluster .spec.controlPlaneEndpoint.host

It is currently exposed by CAPI's metrics endpoint (by default <CAPI-IP>:8080/metrics) You can see it yourself (with kubecontext set to the management cluster) using:

kubectl port-forward -n capi-system deployments/capi-controller-manager 8080:8080 &
curl localhost:8080/metrics | grep -i rest_client_requests_total

For now we don't have an automated way to link the Cluster IP to the Cluster in prometheus. Once #6404 is added to the repo we can add a metric that will link these two pieces of information together giving remote client errors by cluster name / namespace.

So this metric should be used to understand when the remote cluster is uncontactable. Does this suit your use case @perithompson ?

Apr 14 '22 14:04 killianmuldoon

@killianmuldoon @chrischdi I wonder if it would be possible to extend this metric and similar like it with an additional label for the cluster. I think this might be a very nice improvement as it makes it easier to use the metrics and avoid join's in PromQL. (to use those metrics you basically always have to join with another metrics which has the IP)

Apr 26 '22 14:04 sbueringer

@killianmuldoon @chrischdi I wonder if it would be possible to extend this metric and similar like it with an additional label for the cluster. I think this might be a very nice improvement as it makes it easier to use the metrics and avoid join's in PromQL. (to use those metrics you basically always have to join with another metrics which has the IP)

In theory this would be possible by:

passing the cluster value via context to the metrics adapter, and reading it there from the context.

However for us it would be only possible to implement this in three ways, which I think is too much effort or have too many cons:

modifying controller-runtime to support / allow this change in some way
write our own metric but:
- the metric would be required to have another name (not rest_client_requests_total due to controller-runtime registers the metric in a func init(){...} call xref)
- by doing that, rest_client_requests_total would not be used by client-go anymore and thus not exist anymore
rewrite the whole controller-runtime metrics package and use that for the metrics http endpoint

Apr 26 '22 16:04 chrischdi

Maybe 1. or 3. is an option for the future. The upside to investing the effort in controller-runtime is usually that a lot of folks (including other providers) can profit from it.

Apr 26 '22 16:04 sbueringer

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 25 '22 16:07 k8s-triage-robot

/remove-lifecycle stale

Work is still ongoing - tracked in #6458 to make the UX for this better for CAPI

Jul 25 '22 16:07 killianmuldoon

/triage accepted /unassign @killianmuldoon /help

Oct 03 '22 19:10 fabriziopandini

@fabriziopandini: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/triage accepted /unassign @killianmuldoon /help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Oct 03 '22 19:10 k8s-ci-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 08 '23 06:02 k8s-triage-robot

/lifecycle frozen

Feb 09 '23 09:02 fabriziopandini

This should meanwhile be possible by promql queries with the custom resource metrics configuration.

Example:

(sum(rate(rest_client_requests_total{code="<error>"}[1m])) by (host,code,provider))
* on(host) group_left(name)
label_join(capi_cluster_info, "host", ":", "control_plane_endpoint_host", "control_plane_endpoint_port")

(This graph shows the error response rate during cluster creation for this cluster per provider/controller)

Based on this information it should be possible to create alerts :-)

Sep 01 '23 11:09 chrischdi

Very nice!!

Sep 01 '23 11:09 sbueringer

/priority important-longterm

Apr 12 '24 14:04 fabriziopandini

/close

We already have that now, see Christian's example above

Apr 15 '24 06:04 sbueringer

@sbueringer: Closing this issue.

In response to this:

/close

We already have that now, see Christian's example above

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 15 '24 06:04 k8s-ci-robot