actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

Scale Set Metrics - Missing Repo Name and Workflow Name

Open jameshounshell opened this issue 2 years ago • 1 comments

Checks

  • [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
  • [X] I am using charts that are officially provided

Controller Version

0.7.0

Deployment Method

Helm

Checks

  • [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Install both ARC and the ScaleSet helm charts to a kubernetes cluster.
2. Run a job with a self hosted runner.

Describe the bug

With the old self-hosted runners before the Scale Set, prometheus metrics like github_workflow_job_failures_total had repository/repository_full_name/workflow_name/job_name

Example

github_workflow_job_failures_total{container="actions-metrics-server", endpoint="metrics-port", exit_code="1", failed_step="10", instance="10.26.153.77:8080", job="actions-runner-controller-actions-metrics-server", job_name="client-codeql / client-codeql", namespace="github-actions", owner="orgname", pod="actions-runner-controller-actions-metrics-server-598bb5fb99lhl2", repository="reponame", repository_full_name="orgname/reponame", runs_on="self-hosted", service="actions-runner-controller-actions-metrics-server", workflow_name="build"}

I found this very helpful because I could monitor for when jobs began to fail or repos where the developers were running very long/costly ci.

Now with the new metrics we only have job_workflow_ref and job_name. job_workflow_ref only points to the org/repo/relative_path of the workflow, this means with "reuseable workflows" it only shows where the reusable workflow is (in our case a shared CI repo), not the dev's repo which called it.

Example:

gha_completed_jobs_total{container="autoscaler", endpoint="http-metrics", event_name="pull_request", instance="10.26.139.124:8080", job="self-hosted-listener", job_name=" / Build Test Publish", job_result="canceled", job_workflow_ref="myorg/github-actions-shared/.github/workflows/python-library.yaml@refs/heads/v1", namespace="github-actions", organization="myorg", pod="self-hosted-56cc585c-listener", service="self-hosted-listener"}

Describe the expected behavior

I expect in the prometheus metrics for the new Scale Set Listener for there to be the labels workflow_name and repository in addition to job_workflow_ref. As those labels existed in the prior self hosted runner paradigm.

I know that this will increase metric cardinality but it did not seem to be a problem when we ran the old self hosted runners.

I know we have a very unique case where we take advantage of the reusable workflows feature. Otherwise it makes it almost impossible to track what repos are initiating the workflows/jobs.

Thanks in advance 🙏

Additional Context

Not Relevant

Controller Logs

Not Relevant

Runner Pod Logs

Not Relevant

jameshounshell avatar Dec 21 '23 20:12 jameshounshell

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

github-actions[bot] avatar Dec 21 '23 20:12 github-actions[bot]