actions-runner-controller Unlimited number of runner pods (revisited)

Checks

[X] I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
[X] I'm not using a custom entrypoint in my runner image

Controller Version

0.27.4

Helm Chart Version

0.23.3

CertManager Version

1.11.0

Deployment Method

Helm

cert-manager installation

Yes, cert-manager is installed via official Helm chart. cert-manager seems unrelated to the issue reported here.

Checks

[X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
[X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
[X] My actions-runner-controller version (v0.x.y) does support the feature
[X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
[X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  labels:
    variant: prod
  name: bby-ubuntu-autoscaler
  namespace: gha-runners
spec:
  maxReplicas: 200
  minReplicas: 10
  scaleDownDelaySecondsAfterScaleOut: 300
  scaleTargetRef:
    kind: RunnerDeployment
    name: bby-ubuntu
  scaleUpTriggers:
  - duration: 30m
    githubEvent:
      workflowJob: {}
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  labels:
    variant: prod
  name: bby-ubuntu
  namespace: gha-runners
spec:
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
    spec:
      containers:
      - env:
        - name: RUNNER_GRACEFUL_STOP_TIMEOUT
          value: "900"
        name: runner
      dnsPolicy: ClusterFirst
      dockerEnabled: false
      dockerdWithinRunnerContainer: false
      env: null
      ephemeral: true
      group: atat
      hostAliases:
      - hostnames:
        - registry.yarnpkg.com
        - httpredir.debian.org
        - download.ceph.com
        - ppa.launchpadcontent.net
        - get.docker.com
        - download.docker.com
        - apt.postgresql.org
        - eol-repositories.sensuapp.org
        - us.archive.ubuntu.com
        - ppa.launchpad.net
        - security.ubuntu.com
        ip: 0.0.0.0
      # [...truncated... we're blocking upstreams that we mirror with Artifactory]
      image: our.custom.runner.from.summerwind.base/image:latest
      imagePullPolicy: IfNotPresent
      imagePullSecrets:
      - name: image-pull
      labels:
      - bby-ubuntu
      - bby-ubuntu-aws
      nodeSelector:
        eks.amazonaws.com/nodegroup: build
      organization: our-org
      resources:
        limits:
          cpu: "2"
          memory: 6Gi
        requests:
          cpu: 500m
          memory: 1Gi
      terminationGracePeriodSeconds: 960
      tolerations:
      - effect: NoExecute
        key: type
        operator: Equal
        value: build-infra
      - effect: NoExecute
        key: node.kubernetes.io/unreachable
        operator: Exists
        tolerationSeconds: 10
      volumeMounts:
      - mountPath: /opt/hostedtoolcache
        name: opt-hostedtoolcache
      volumes:
      - emptyDir: {}
        name: opt-hostedtoolcache

To Reproduce

This will not be easy to reproduce...
1. Set up 2 completely separate instances of ARC (call them ARC-A and ARC-B), with separate runner labels and groups, and separate GH app credentials, under one GitHub org. Both use ephemeral runners.
1.a. ARC-A does not use scaling, simple polling for jobs with set number of runners available.
1.b. ARC-B uses webhook scaling.
2. Do something to prevent ARC-A from cleaning up its finished runners from the GH API, until runner group A is showing completely full (10,000 runners in group, with 99% of them showing as "offline" status.
3. Observe ARC-B begin scaling runners, ignoring the `maxReplicas` defined on its HRA.

Describe the bug

Note the "steps to reproduce" section, re: multiple instances of ARC for the one GitHub organization. ARC-A controls the "legacy" runners in our data center, nearing retirement; ARC-B is running in AWS EKS as the current standard. The issue reported here is strictly about the behavior of ARC-B as an outcome of a separate issue that happened to ARC-A.

This seems related to https://github.com/actions/actions-runner-controller/issues/1646. It was the only issue I could find with the same symptoms. In particular, this comment was enlightening.

We were alerted from our Grafana based on the spike in pod count. After reading the issue mentioned above, I checked the count of pods having the actions-runner/id annotation:

# count of pods in gha-runners namespace
❯ kubectl get pods -n gha-runners | grep -c ".*"
514 # <-- far above normal (usually averaging 30-60 total)

# count of runner pods having the id annotation
❯ kubectl get pods -n gha-runners -o yaml | yq '.items[].metadata.annotations | has("actions-runner/id")' | grep -c 'true'
2 # <-- far too low for number of pods

It seems like due to the saturation of ARC-A's runner group, and the fact that the "list runners" endpoint GET /orgs/{org}/actions/runners returns ALL runners in the org, with no option to filter based on runner group, it seems that ARC-B was not able to paginate that endpoint and find all of its managed runners quickly enough and thus was not able to find the GitHub integer ID for each of its runners, was not able to annotate the pods with the ID, and thus was not able to clean them up either, while the runners themselves were in fact registered properly with GitHub.

ARC-B controller was logging many errors like below:

2023-12-01T16:27:23Z	INFO	runnerreplicaset	Runner failed to register itself to GitHub in timely manner. Recreating the pod to see if it resolves the issue. CAUTION: If you see this a lot, you should investigate the root cause. See https://github.com/actions/actions-runner-controller/issues/288	{"runnerreplicaset": "gha-runners/bby-ubuntu-4lt5f", "owner": "gha-runners/bby-ubuntu-4lt5f-xfjwv", "creationTimestamp": "2023-12-01 16:16:04 +0000 UTC", "readyTransitionTime": "2023-12-01 16:16:05 +0000 UTC", "configuredRegistrationTimeout": "10m0s"}

Screenshot 2023-12-01 at 2 50 14 PM

Describe the expected behavior

Other than the shared GitHub org and using the same runner image, these 2 ARC instances have nothing in common. I would have expected separate instances managing separate runner groups to be unaffected by problems in the other.

Whole Controller Logs

The logs from this incident are older than those available via kubectl now.  Exported from Grafana a relevant timeframe instead.

https://gist.github.com/nimjor/fff0ac4cbca94f0e19358aa367c45814

Whole Runner Pod Logs

Example logs from a runner that was successfully registered and completed someone's workflow job even though ARC controller thought it was not properly registered: 

https://gist.github.com/nimjor/81fe3b73ebad771a6f7147d4542e0e46

Additional Context

There are hundreds of entries just like this in the controller logs for the example runner above:

2023-12-01T13:43:25Z	INFO	runnerreplicaset	Runner failed to register itself to GitHub in timely manner. Recreating the pod to see if it resolves the issue. CAUTION: If you see this a lot, you should investigate the root cause. See https://github.com/actions/actions-runner-controller/issues/288	{"runnerreplicaset": "gha-runners/bby-ubuntu-4lt5f", "owner": "gha-runners/bby-ubuntu-4lt5f-47h9d", "creationTimestamp": "2023-12-01 10:46:56 +0000 UTC", "readyTransitionTime": "2023-12-01 10:46:57 +0000 UTC", "configuredRegistrationTimeout": "10m0s"}

Dec 01 '23 21:12 nimjor

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

Dec 01 '23 21:12 github-actions[bot]

@mumoshu we're also observing similar error. Would really appreciate any insight from your end.

Dec 04 '23 01:12 shankarRaman