Unlimited number of runner pods (revisited)
Checks
- [X] I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I'm not using a custom entrypoint in my runner image
Controller Version
0.27.4
Helm Chart Version
0.23.3
CertManager Version
1.11.0
Deployment Method
Helm
cert-manager installation
Yes, cert-manager is installed via official Helm chart. cert-manager seems unrelated to the issue reported here.
Checks
- [X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
- [X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
- [X] My actions-runner-controller version (v0.x.y) does support the feature
- [X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
- [X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)
Resource Definitions
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
labels:
variant: prod
name: bby-ubuntu-autoscaler
namespace: gha-runners
spec:
maxReplicas: 200
minReplicas: 10
scaleDownDelaySecondsAfterScaleOut: 300
scaleTargetRef:
kind: RunnerDeployment
name: bby-ubuntu
scaleUpTriggers:
- duration: 30m
githubEvent:
workflowJob: {}
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
labels:
variant: prod
name: bby-ubuntu
namespace: gha-runners
spec:
template:
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
spec:
containers:
- env:
- name: RUNNER_GRACEFUL_STOP_TIMEOUT
value: "900"
name: runner
dnsPolicy: ClusterFirst
dockerEnabled: false
dockerdWithinRunnerContainer: false
env: null
ephemeral: true
group: atat
hostAliases:
- hostnames:
- registry.yarnpkg.com
- httpredir.debian.org
- download.ceph.com
- ppa.launchpadcontent.net
- get.docker.com
- download.docker.com
- apt.postgresql.org
- eol-repositories.sensuapp.org
- us.archive.ubuntu.com
- ppa.launchpad.net
- security.ubuntu.com
ip: 0.0.0.0
# [...truncated... we're blocking upstreams that we mirror with Artifactory]
image: our.custom.runner.from.summerwind.base/image:latest
imagePullPolicy: IfNotPresent
imagePullSecrets:
- name: image-pull
labels:
- bby-ubuntu
- bby-ubuntu-aws
nodeSelector:
eks.amazonaws.com/nodegroup: build
organization: our-org
resources:
limits:
cpu: "2"
memory: 6Gi
requests:
cpu: 500m
memory: 1Gi
terminationGracePeriodSeconds: 960
tolerations:
- effect: NoExecute
key: type
operator: Equal
value: build-infra
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 10
volumeMounts:
- mountPath: /opt/hostedtoolcache
name: opt-hostedtoolcache
volumes:
- emptyDir: {}
name: opt-hostedtoolcache
To Reproduce
This will not be easy to reproduce...
1. Set up 2 completely separate instances of ARC (call them ARC-A and ARC-B), with separate runner labels and groups, and separate GH app credentials, under one GitHub org. Both use ephemeral runners.
1.a. ARC-A does not use scaling, simple polling for jobs with set number of runners available.
1.b. ARC-B uses webhook scaling.
2. Do something to prevent ARC-A from cleaning up its finished runners from the GH API, until runner group A is showing completely full (10,000 runners in group, with 99% of them showing as "offline" status.
3. Observe ARC-B begin scaling runners, ignoring the `maxReplicas` defined on its HRA.
Describe the bug
Note the "steps to reproduce" section, re: multiple instances of ARC for the one GitHub organization. ARC-A controls the "legacy" runners in our data center, nearing retirement; ARC-B is running in AWS EKS as the current standard. The issue reported here is strictly about the behavior of ARC-B as an outcome of a separate issue that happened to ARC-A.
This seems related to https://github.com/actions/actions-runner-controller/issues/1646. It was the only issue I could find with the same symptoms. In particular, this comment was enlightening.
We were alerted from our Grafana based on the spike in pod count. After reading the issue mentioned above, I checked the count of pods having the actions-runner/id annotation:
# count of pods in gha-runners namespace
❯ kubectl get pods -n gha-runners | grep -c ".*"
514 # <-- far above normal (usually averaging 30-60 total)
# count of runner pods having the id annotation
❯ kubectl get pods -n gha-runners -o yaml | yq '.items[].metadata.annotations | has("actions-runner/id")' | grep -c 'true'
2 # <-- far too low for number of pods
It seems like due to the saturation of ARC-A's runner group, and the fact that the "list runners" endpoint GET /orgs/{org}/actions/runners returns ALL runners in the org, with no option to filter based on runner group, it seems that ARC-B was not able to paginate that endpoint and find all of its managed runners quickly enough and thus was not able to find the GitHub integer ID for each of its runners, was not able to annotate the pods with the ID, and thus was not able to clean them up either, while the runners themselves were in fact registered properly with GitHub.
ARC-B controller was logging many errors like below:
2023-12-01T16:27:23Z INFO runnerreplicaset Runner failed to register itself to GitHub in timely manner. Recreating the pod to see if it resolves the issue. CAUTION: If you see this a lot, you should investigate the root cause. See https://github.com/actions/actions-runner-controller/issues/288 {"runnerreplicaset": "gha-runners/bby-ubuntu-4lt5f", "owner": "gha-runners/bby-ubuntu-4lt5f-xfjwv", "creationTimestamp": "2023-12-01 16:16:04 +0000 UTC", "readyTransitionTime": "2023-12-01 16:16:05 +0000 UTC", "configuredRegistrationTimeout": "10m0s"}
Describe the expected behavior
Other than the shared GitHub org and using the same runner image, these 2 ARC instances have nothing in common. I would have expected separate instances managing separate runner groups to be unaffected by problems in the other.
Whole Controller Logs
The logs from this incident are older than those available via kubectl now. Exported from Grafana a relevant timeframe instead.
https://gist.github.com/nimjor/fff0ac4cbca94f0e19358aa367c45814
Whole Runner Pod Logs
Example logs from a runner that was successfully registered and completed someone's workflow job even though ARC controller thought it was not properly registered:
https://gist.github.com/nimjor/81fe3b73ebad771a6f7147d4542e0e46
Additional Context
There are hundreds of entries just like this in the controller logs for the example runner above:
2023-12-01T13:43:25Z INFO runnerreplicaset Runner failed to register itself to GitHub in timely manner. Recreating the pod to see if it resolves the issue. CAUTION: If you see this a lot, you should investigate the root cause. See https://github.com/actions/actions-runner-controller/issues/288 {"runnerreplicaset": "gha-runners/bby-ubuntu-4lt5f", "owner": "gha-runners/bby-ubuntu-4lt5f-47h9d", "creationTimestamp": "2023-12-01 10:46:56 +0000 UTC", "readyTransitionTime": "2023-12-01 10:46:57 +0000 UTC", "configuredRegistrationTimeout": "10m0s"}
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
@mumoshu we're also observing similar error. Would really appreciate any insight from your end.