Occasionally jobs are stuck in "waiting for a runnner to come online"
Checks
- [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I am using charts that are officially provided
Controller Version
0.9.3
Deployment Method
Helm
Checks
- [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
- [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
No clear reproduction steps for the bug, as it is occasional.
Team have noticed it occurs most often in the morning.
Describe the bug
In the example, on the CI piepline:
- At 06:49 A job was suppose to be schedule on gm2.dind.large-b4j7w runner after the job "get-ecr-login-password"
- During 40 minutes the job "unit-tests" is stuck in waiting for runner
- At 07:40:43 in the controller we get the log "Created new ephemeral runner", which is named gm2.dind.large-b4j7w-runner-bsk5p. The runner will be use to run the job "unit-tests"
We have multiple cases of jobs waiting a long time for a runner, without any clear explanation.
Describe the expected behavior
- Jobs is trigger
- ARC controler create the runner
- Runner is taking the job and start running it
Additional Context
There has been attempt to play around the 'minWorkers' configuration. of the runner-set to prevent the waiting issue.
Runner-set are deployed with helm using the values.yaml file:
githubConfigUrl: ${github_config_url}
githubConfigSecret:
github_app_id: "${github_app_id}"
github_app_installation_id: "${github_app_installation_id}"
github_app_private_key: |
${github_app_private_key}
containerMode:
type: "kubernetes"
kubernetesModeWorkVolumeClaim:
accessModes: ["ReadWriteOnce"]
storageClassName: "github-arc"
resources:
requests:
storage: 10Gi
minWorkers: 2
listenerTemplate:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/path: "/metrics"
prometheus.io/port: "8080"
spec:
containers:
- name: listener
securityContext:
runAsUser: 1000
resources:
limits:
memory: 64Mi
requests:
cpu: 100m
memory: 64Mi
nodeSelector:
"karpenter.sh/nodepool": runner
tolerations:
- key: "karpenter.sh/nodepool"
value: "runner"
operator: "Equal"
effect: "NoSchedule"
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/path: "/metrics"
prometheus.io/port: "8080"
spec:
securityContext:
fsGroup: 1001
serviceAccount: "${sa_name}"
containers:
- name: runner
image: REDACTED/github-arc-runner:0.7
command: ["/home/runner/run.sh"]
env:
- name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
value: "false"
resources:
limits:
cpu: "${resources.cpu}"
memory: "${resources.memory}"
requests:
cpu: "${resources.cpu}"
memory: "${resources.memory}"
nodeSelector:
"karpenter.sh/nodepool": runner
tolerations:
- key: "karpenter.sh/nodepool"
value: "runner"
operator: "Equal"
effect: "NoSchedule"
Controller Logs
Logs would be best read when filtered on "gm2.dind.large-b4j7w"
"https://gist.github.com/victor-smg/1c7ed78e9702dd2389e98725c9ac37f5.js"
Runner Pod Logs
No relevant logs for the runner, as it is the lack of runner that is the issue
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
We are now facing this issue also, same version.
Bit more info, maybe will help.
So far we thought it is mostly problem in the morning, but we are impacted by it right now as well. In theory there was just GH Copilot outage in last hour, nothing about Actions or API, but still one of our runners is NOT fetching jobs. Seems like autoscalingrunnersets is not doing it's job correctly.
krzyzakp@X1Carbon:/home/krzyzakp $ k describe -n github-arc autoscalingrunnersets.actions.github.com gm2.small
Name: gm2.small
Namespace: github-arc
Labels: actions.github.com/organization=tutti-ch
actions.github.com/scale-set-name=gm2.small
actions.github.com/scale-set-namespace=github-arc
app.kubernetes.io/component=autoscaling-runner-set
app.kubernetes.io/instance=gm2.small
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=gm2.small
app.kubernetes.io/part-of=gha-rs
app.kubernetes.io/version=0.9.3
helm.sh/chart=gha-rs-0.9.3
Annotations: actions.github.com/cleanup-github-secret-name: gm2.small-gha-rs-github-secret
actions.github.com/cleanup-kubernetes-mode-role-binding-name: gm2.small-gha-rs-kube-mode
actions.github.com/cleanup-kubernetes-mode-role-name: gm2.small-gha-rs-kube-mode
actions.github.com/cleanup-kubernetes-mode-service-account-name: gm2.small-gha-rs-kube-mode
actions.github.com/cleanup-manager-role-binding: gm2.small-gha-rs-manager
actions.github.com/cleanup-manager-role-name: gm2.small-gha-rs-manager
actions.github.com/runner-group-name: Default
actions.github.com/runner-scale-set-name: gm2.small
actions.github.com/values-hash: <REDACTED in comment>
meta.helm.sh/release-name: gm2.small
meta.helm.sh/release-namespace: github-arc
runner-scale-set-id: 2
API Version: actions.github.com/v1alpha1
Kind: AutoscalingRunnerSet
Metadata:
Creation Timestamp: 2024-05-22T12:57:11Z
Finalizers:
autoscalingrunnerset.actions.github.com/finalizer
Generation: 32
Resource Version: 326465633
UID: b6f45bbd-9c25-475c-b633-ebd4290365e3
Spec:
Github Config Secret: gm2.small-gha-rs-github-secret
Github Config URL: https://github.com/tutti-ch
Listener Template:
Metadata:
Annotations:
prometheus.io/path: /metrics
prometheus.io/port: 8080
prometheus.io/scrape: true
Spec:
Containers:
Name: listener
Resources:
Limits:
Memory: 64Mi
Requests:
Cpu: 100m
Memory: 64Mi
Security Context:
Run As User: 1000
Node Selector:
karpenter.sh/nodepool: runner
Tolerations:
Effect: NoSchedule
Key: karpenter.sh/nodepool
Operator: Equal
Value: runner
Template:
Metadata:
Annotations:
prometheus.io/path: /metrics
prometheus.io/port: 8080
prometheus.io/scrape: true
Spec:
Containers:
Command:
/home/runner/run.sh
Env:
Name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
Value: false
Name: ACTIONS_RUNNER_CONTAINER_HOOKS
Value: /home/runner/k8s/index.js
Name: ACTIONS_RUNNER_POD_NAME
Value From:
Field Ref:
Field Path: metadata.name
Image: 668273420038.dkr.ecr.eu-central-1.amazonaws.com/github-arc-runner:0.7
Name: runner
Resources:
Limits:
Cpu: 500m
Memory: 1Gi
Requests:
Cpu: 500m
Memory: 1Gi
Volume Mounts:
Mount Path: /home/runner/_work
Name: work
Node Selector:
karpenter.sh/nodepool: runner
Restart Policy: Never
Security Context:
Fs Group: 1001
Service Account: gm.small
Service Account Name: gm2.small-gha-rs-kube-mode
Tolerations:
Effect: NoSchedule
Key: karpenter.sh/nodepool
Operator: Equal
Value: runner
Volumes:
Ephemeral:
Volume Claim Template:
Spec:
Access Modes:
ReadWriteOnce
Resources:
Requests:
Storage: 10Gi
Storage Class Name: github-arc
Name: work
Status:
Current Runners: 23
Failed Ephemeral Runners: 22
Pending Ephemeral Runners: 1
Running Ephemeral Runners: 0
Events: <none>
You can clearly see that there are 23 runners, failed are 22, but even if I try to restart job and get a hold of pod, don't see any. What more surprising, other runners, using exactly same config (except mem/cpu limits) are working fine - even if they're bigger. So don't think there is no node capacity.
On top of that in logs, I see (read from bottom to top):
2024-07-11 17:20:19.366
2024-07-11T15:20:19Z INFO EphemeralRunner Deleting the ephemeral runner pod {"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}, "podId": "61019e6a-6221-4d3d-8fa3-689cad9a2396"}
2024-07-11 17:20:19.366
2024-07-11T15:20:19Z INFO EphemeralRunner Ephemeral runner pod has finished, but the runner still exists in the service. Deleting the pod to restart it. {"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:19.366
2024-07-11T15:20:19Z INFO EphemeralRunner Runner exists in GitHub service {"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}, "runnerId": 604044}
2024-07-11 17:20:19.271
2024-07-11T15:20:19Z INFO EphemeralRunner Checking if runner exists in GitHub service {"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}, "runnerId": 604044}
2024-07-11 17:20:15.287
2024-07-11T15:20:15Z INFO AutoscalingRunnerSet Find existing ephemeral runner set {"version": "0.9.3", "autoscalingrunnerset": {"name":"gm2.small","namespace":"github-arc"}, "name": "gm2.small-zh6h4", "specHash": "75f975979"}
2024-07-11 17:20:15.285
2024-07-11T15:20:15Z INFO EphemeralRunner Ephemeral runner container is still running {"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:15.284
2024-07-11T15:20:15Z INFO EphemeralRunnerSet Scaling comparison {"version": "0.9.3", "ephemeralrunnerset": {"name":"gm2.small-zh6h4","namespace":"github-arc"}, "current": 23, "desired": 23}
2024-07-11 17:20:15.284
2024-07-11T15:20:15Z INFO EphemeralRunnerSet Ephemeral runner counts {"version": "0.9.3", "ephemeralrunnerset": {"name":"gm2.small-zh6h4","namespace":"github-arc"}, "pending": 0, "running": 1, "finished": 0, "failed": 22, "deleting": 0}
2024-07-11 17:20:15.272
2024-07-11T15:20:15Z INFO EphemeralRunnerSet Updating status with current runners count {"version": "0.9.3", "ephemeralrunnerset": {"name":"gm2.small-zh6h4","namespace":"github-arc"}, "count": 23}
2024-07-11 17:20:15.272
2024-07-11T15:20:15Z INFO EphemeralRunnerSet Scaling comparison {"version": "0.9.3", "ephemeralrunnerset": {"name":"gm2.small-zh6h4","namespace":"github-arc"}, "current": 23, "desired": 23}
2024-07-11 17:20:15.272
2024-07-11T15:20:15Z INFO EphemeralRunnerSet Ephemeral runner counts {"version": "0.9.3", "ephemeralrunnerset": {"name":"gm2.small-zh6h4","namespace":"github-arc"}, "pending": 0, "running": 1, "finished": 0, "failed": 22, "deleting": 0}
2024-07-11 17:20:15.271
2024-07-11T15:20:15Z INFO EphemeralRunner Updated ephemeral runner status with pod phase {"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:15.257
2024-07-11T15:20:15Z INFO EphemeralRunner Updating ephemeral runner status with pod phase {"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}, "statusPhase": "Running", "statusReason": "", "statusMessage": ""}
2024-07-11 17:20:15.257
2024-07-11T15:20:15Z INFO EphemeralRunner Ephemeral runner container is still running {"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:09.099
2024-07-11T15:20:09Z INFO EphemeralRunner Ephemeral runner container is still running {"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:09.096
2024-07-11T15:20:09Z INFO EphemeralRunnerSet Scaling comparison {"version": "0.9.3", "ephemeralrunnerset": {"name":"gm2.small-zh6h4","namespace":"github-arc"}, "current": 23, "desired": 23}
2024-07-11 17:20:09.096
2024-07-11T15:20:09Z INFO EphemeralRunnerSet Ephemeral runner counts {"version": "0.9.3", "ephemeralrunnerset": {"name":"gm2.small-zh6h4","namespace":"github-arc"}, "pending": 1, "running": 0, "finished": 0, "failed": 22, "deleting": 0}
2024-07-11 17:20:09.096
2024-07-11T15:20:09Z INFO EphemeralRunner Updated ephemeral runner status with pod phase {"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:09.089
2024-07-11T15:20:09Z INFO EphemeralRunner Updating ephemeral runner status with pod phase {"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}, "statusPhase": "Pending", "statusReason": "", "statusMessage": ""}
2024-07-11 17:20:09.089
2024-07-11T15:20:09Z INFO EphemeralRunner Ephemeral runner container is still running {"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:09.070
2024-07-11T15:20:09Z INFO EphemeralRunner Waiting for runner container status to be available {"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:02.156
2024-07-11T15:20:02Z INFO EphemeralRunner Waiting for runner container status to be available {"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:02.153
2024-07-11T15:20:02Z INFO EphemeralRunner Waiting for runner container status to be available {"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:02.145
2024-07-11T15:20:02Z INFO EphemeralRunner Created ephemeral runner pod {"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}, "runnerScaleSetId": 2, "runnerName": "gm2.small-zh6h4-runner-v9sch", "runnerId": 604044, "configUrl": "https://github.com/tutti-ch", "podName": "gm2.small-zh6h4-runner-v9sch"}
2024-07-11 17:20:02.116
2024-07-11T15:20:02Z INFO EphemeralRunner Created new pod spec for ephemeral runner {"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:02.116
2024-07-11T15:20:02Z INFO EphemeralRunner Creating new pod for ephemeral runner {"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
2024-07-11 17:20:02.116
2024-07-11T15:20:02Z INFO EphemeralRunner Creating new EphemeralRunner pod. {"version": "0.9.3", "ephemeralrunner": {"name":"gm2.small-zh6h4-runner-v9sch","namespace":"github-arc"}}
It's super strange, that pod was running just few seconds and than is deleted.
is there a workaround for this? the issue effectively causes a very poor experience
experiencing this issue aswell with a new deployment on AKS. the runner job is never deployed. the only custom thing is the fact that there is a nodeSelector
controller logs
2024-07-22T09:51:19Z INFO EphemeralRunnerSet Ephemeral runner counts {"version": "0.9.3", "ephemeralrunnerset": {"name":"arc-runner-itplat-jdf4g","namespace":"arc-runner-itplat"}, "pending": 0, "running": 0, "finished": 0, "failed": 1, "deleting": 0}
2024-07-22T09:51:19Z INFO EphemeralRunnerSet Scaling comparison {"version": "0.9.3", "ephemeralrunnerset": {"name":"arc-runner-itplat-jdf4g","namespace":"arc-runner-itplat"}, "current": 1, "desired": 1}
2024-07-22T09:51:19Z INFO AutoscalingRunnerSet Find existing ephemeral runner set {"version": "0.9.3", "autoscalingrunnerset": {"name":"arc-runner-itplat","namespace":"arc-runner-itplat"}, "name": "arc-runner-itplat-jdf4g", "specHash": "84bc4c7fd5"}
2024-07-22T09:52:09Z INFO EphemeralRunnerSet Ephemeral runner counts {"version": "0.9.3", "ephemeralrunnerset": {"name":"arc-runner-itplat-jdf4g","namespace":"arc-runner-itplat"}, "pending": 0, "running": 0, "finished": 0, "failed": 1, "deleting": 0}
2024-07-22T09:52:09Z INFO EphemeralRunnerSet Scaling comparison {"version": "0.9.3", "ephemeralrunnerset": {"name":"arc-runner-itplat-jdf4g","namespace":"arc-runner-itplat"}, "current": 1, "desired": 1}
2024-07-22T09:52:09Z INFO AutoscalingRunnerSet Find existing ephemeral runner set {"version": "0.9.3", "autoscalingrunnerset": {"name":"arc-runner-itplat","namespace":"arc-runner-itplat"}, "name": "arc-runner-itplat-jdf4g", "specHash": "84bc4c7fd5"}
2024-07-22T09:52:59Z INFO EphemeralRunnerSet Ephemeral runner counts {"version": "0.9.3", "ephemeralrunnerset": {"name":"arc-runner-itplat-jdf4g","namespace":"arc-runner-itplat"}, "pending": 0, "running": 0, "finished": 0, "failed": 1, "deleting": 0}
2024-07-22T09:52:59Z INFO EphemeralRunnerSet Scaling comparison {"version": "0.9.3", "ephemeralrunnerset": {"name":"arc-runner-itplat-jdf4g","namespace":"arc-runner-itplat"}, "current": 1, "desired": 1}
2024-07-22T09:52:59Z INFO AutoscalingRunnerSet Find existing ephemeral runner set {"version": "0.9.3", "autoscalingrunnerset": {"name":"arc-runner-itplat","namespace":"arc-runner-itplat"}, "name": "arc-runner-itplat-jdf4g", "specHash": "84bc4c7fd5"}
runner logs
2024-07-22T09:51:19Z INFO listener-app.worker.kubernetesworker Preparing EphemeralRunnerSet update {"json": "{\"spec\":{\"patchID\":0,\"replicas\":1}}"}
2024-07-22T09:51:19Z INFO listener-app.worker.kubernetesworker Ephemeral runner set scaled. {"namespace": "arc-runner-itplat", "name": "arc-runner-itplat-jdf4g", "replicas": 1}
2024-07-22T09:51:19Z INFO listener-app.listener Getting next message {"lastMessageID": 0}
2024-07-22T09:52:09Z INFO listener-app.worker.kubernetesworker Calculated target runner count {"assigned job": 0, "decision": 1, "min": 0, "max": 2147483647, "currentRunnerCount": 1, "jobsCompleted": 0}
2024-07-22T09:52:09Z INFO listener-app.worker.kubernetesworker Compare {"original": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":-1,\"patchID\":-1,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}", "patch": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":1,\"patchID\":1,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}"}
2024-07-22T09:52:09Z INFO listener-app.worker.kubernetesworker Preparing EphemeralRunnerSet update {"json": "{\"spec\":{\"patchID\":1,\"replicas\":1}}"}
2024-07-22T09:52:09Z INFO listener-app.worker.kubernetesworker Ephemeral runner set scaled. {"namespace": "arc-runner-itplat", "name": "arc-runner-itplat-jdf4g", "replicas": 1}
2024-07-22T09:52:09Z INFO listener-app.listener Getting next message {"lastMessageID": 0}
2024-07-22T09:52:59Z INFO listener-app.worker.kubernetesworker Calculated target runner count {"assigned job": 0, "decision": 1, "min": 0, "max": 2147483647, "currentRunnerCount": 1, "jobsCompleted": 0}
2024-07-22T09:52:59Z INFO listener-app.worker.kubernetesworker Compare {"original": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":-1,\"patchID\":-1,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}", "patch": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":1,\"patchID\":2,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}"}
2024-07-22T09:52:59Z INFO listener-app.worker.kubernetesworker Preparing EphemeralRunnerSet update {"json": "{\"spec\":{\"patchID\":2,\"replicas\":1}}"}
2024-07-22T09:52:59Z INFO listener-app.worker.kubernetesworker Ephemeral runner set scaled. {"namespace": "arc-runner-itplat", "name": "arc-runner-itplat-jdf4g", "replicas": 1}
2024-07-22T09:52:59Z INFO listener-app.listener Getting next message {"lastMessageID": 0}
2024-07-22T09:53:49Z INFO listener-app.worker.kubernetesworker Calculated target runner count {"assigned job": 0, "decision": 1, "min": 0, "max": 2147483647, "currentRunnerCount": 1, "jobsCompleted": 0}
2024-07-22T09:53:49Z INFO listener-app.worker.kubernetesworker Compare {"original": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":-1,\"patchID\":-1,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}", "patch": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":1,\"patchID\":3,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}"}
2024-07-22T09:53:49Z INFO listener-app.worker.kubernetesworker Preparing EphemeralRunnerSet update {"json": "{\"spec\":{\"patchID\":3,\"replicas\":1}}"}
2024-07-22T09:53:49Z INFO listener-app.worker.kubernetesworker Ephemeral runner set scaled. {"namespace": "arc-runner-itplat", "name": "arc-runner-itplat-jdf4g", "replicas": 1}
2024-07-22T09:53:49Z INFO listener-app.listener Getting next message {"lastMessageID": 0}
2024-07-22T09:54:40Z INFO listener-app.worker.kubernetesworker Calculated target runner count {"assigned job": 0, "decision": 1, "min": 0, "max": 2147483647, "currentRunnerCount": 1, "jobsCompleted": 0}
2024-07-22T09:54:40Z INFO listener-app.worker.kubernetesworker Compare {"original": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":-1,\"patchID\":-1,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}", "patch": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":1,\"patchID\":4,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}"}
2024-07-22T09:54:40Z INFO listener-app.worker.kubernetesworker Preparing EphemeralRunnerSet update {"json": "{\"spec\":{\"patchID\":4,\"replicas\":1}}"}
2024-07-22T09:54:40Z INFO listener-app.worker.kubernetesworker Ephemeral runner set scaled. {"namespace": "arc-runner-itplat", "name": "arc-runner-itplat-jdf4g", "replicas": 1}
2024-07-22T09:54:40Z INFO listener-app.listener Getting next message {"lastMessageID": 0}
2024-07-22T09:55:30Z INFO listener-app.worker.kubernetesworker Calculated target runner count {"assigned job": 0, "decision": 1, "min": 0, "max": 2147483647, "currentRunnerCount": 1, "jobsCompleted": 0}
2024-07-22T09:55:30Z INFO listener-app.worker.kubernetesworker Compare {"original": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":-1,\"patchID\":-1,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}", "patch": "{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"replicas\":1,\"patchID\":5,\"ephemeralRunnerSpec\":{\"metadata\":{\"creationTimestamp\":null},\"spec\":{\"containers\":null}}},\"status\":{\"currentReplicas\":0,\"pendingEphemeralRunners\":0,\"runningEphemeralRunners\":0,\"failedEphemeralRunners\":0}}"}
2024-07-22T09:55:30Z INFO listener-app.worker.kubernetesworker Preparing EphemeralRunnerSet update {"json": "{\"spec\":{\"patchID\":5,\"replicas\":1}}"}
2024-07-22T09:55:30Z INFO listener-app.worker.kubernetesworker Ephemeral runner set scaled. {"namespace": "arc-runner-itplat", "name": "arc-runner-itplat-jdf4g", "replicas": 1}
2024-07-22T09:55:30Z INFO listener-app.listener Getting next message {"lastMessageID": 0}
Totally blocked.
We've seen the same issue. We noticed that there are other issues where this issue was discovered in older versions of actions runner controller, and one of them has a comment saying that the issue is fixed by restarting the listener.
This doesn't fix the issue for anyone, but as triage you can deploy a cron job with something like this:
apiVersion: batch/v1
kind: CronJob
metadata:
name: actions-listener-restart-cron
namespace: actions
spec:
schedule: "*/5 * * * *"
jobTemplate:
spec:
template:
spec:
serviceAccountName: your-service-account
containers:
- name: actions-listener-restart-pod
image: bitnami/kubectl:v1.26.14
imagePullPolicy: Always
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
runAsUser: 999
command:
- /bin/sh
- -c
- kubectl delete pod -n actions -l app.kubernetes.io/component=runner-scale-set-listener
restartPolicy: Never
You'll also need to define the ServiceAcccount and its role/rolebinding, but the above should get you started.
On our side, it somehow magically stopped to be an issue over 2 weeks ago. Since that time, no single job was not picked up (except around an hour, when we messed up our networking in AWS and you couldn't reach the Kubernetes cluster - but it was purely problem on network, which we caused). So for now I can't really debug it more or add recommended restart, as it simply works.
Hello, we have a similar situation on our runners. We have basically the same template applied to all the different labels, the difference between them is basically the resources set. We noticed that even deleting all the resources and reinstalling everything causes the same behaviour. Here are the logs for a working listener triggering new pods:
{"severity":"info","ts":"2024-09-11T15:14:22Z","logger":"listener-app.listener","message":"Processing message","messageId":10391,"messageType":"RunnerScaleSetJobMessages"}
{"severity":"info","ts":"2024-09-11T15:14:22Z","logger":"listener-app.listener","message":"New runner scale set statistics.","statistics":{"totalAvailableJobs":1,"totalAcquiredJobs":0,"totalAssignedJobs":0,"totalRunningJobs":0,"totalRegisteredRunners":0,"totalBusyRunners":0,"totalIdleRunners":0}}
{"severity":"info","ts":"2024-09-11T15:14:22Z","logger":"listener-app.listener","message":"Job available message received","jobId":7848922}
{"severity":"info","ts":"2024-09-11T15:14:22Z","logger":"listener-app.listener","message":"Acquiring jobs","count":1,"requestIds":"[7848922]"}
{"severity":"info","ts":"2024-09-11T15:14:22Z","logger":"listener-app.listener","message":"Jobs are acquired","count":1,"requestIds":"[7848922]"}
{"severity":"info","ts":"2024-09-11T15:14:22Z","logger":"listener-app.listener","message":"Deleting last message","lastMessageID":10391}
{"severity":"info","ts":"2024-09-11T15:14:23Z","logger":"listener-app.worker.kubernetesworker","message":"Created merge patch json for EphemeralRunnerSet update","json":"{\"spec\":{\"replicas\":null}}"}
{"severity":"info","ts":"2024-09-11T15:14:23Z","logger":"listener-app.worker.kubernetesworker","message":"Scaling ephemeral runner set","assigned job":0,"decision":0,"min":0,"max":20,"currentRunnerCount":-1}
{"severity":"info","ts":"2024-09-11T15:14:23Z","logger":"listener-app.worker.kubernetesworker","message":"Ephemeral runner set scaled.","namespace":"NAMESPACE","name":"RUNNER-NAME","replicas":0}
{"severity":"info","ts":"2024-09-11T15:14:23Z","logger":"listener-app.listener","message":"Getting next message","lastMessageID":10391}
{"severity":"info","ts":"2024-09-11T15:14:29Z","logger":"listener-app.listener","message":"Processing message","messageId":10392,"messageType":"RunnerScaleSetJobMessages"}
{"severity":"info","ts":"2024-09-11T15:14:29Z","logger":"listener-app.listener","message":"New runner scale set statistics.","statistics":{"totalAvailableJobs":0,"totalAcquiredJobs":1,"totalAssignedJobs":1,"totalRunningJobs":0,"totalRegisteredRunners":0,"totalBusyRunners":0,"totalIdleRunners":0}}
{"severity":"info","ts":"2024-09-11T15:14:29Z","logger":"listener-app.listener","message":"Job assigned message received","jobId":7848922}
{"severity":"info","ts":"2024-09-11T15:14:29Z","logger":"listener-app.listener","message":"Deleting last message","lastMessageID":10392}
{"severity":"info","ts":"2024-09-11T15:14:30Z","logger":"listener-app.worker.kubernetesworker","message":"Created merge patch json for EphemeralRunnerSet update","json":"{\"spec\":{\"replicas\":1}}"}
{"severity":"info","ts":"2024-09-11T15:14:30Z","logger":"listener-app.worker.kubernetesworker","message":"Scaling ephemeral runner set","assigned job":1,"decision":1,"min":0,"max":20,"currentRunnerCount":-1}
{"severity":"info","ts":"2024-09-11T15:14:30Z","logger":"listener-app.worker.kubernetesworker","message":"Ephemeral runner set scaled.","namespace":"NAMESPACE","name":"RUNNER-NAME","replicas":1}
{"severity":"info","ts":"2024-09-11T15:14:30Z","logger":"listener-app.listener","message":"Getting next message","lastMessageID":10392}
{"severity":"info","ts":"2024-09-11T15:14:44Z","logger":"listener-app.listener","message":"Processing message","messageId":10393,"messageType":"RunnerScaleSetJobMessages"}
{"severity":"info","ts":"2024-09-11T15:14:44Z","logger":"listener-app.listener","message":"New runner scale set statistics.","statistics":{"totalAvailableJobs":0,"totalAcquiredJobs":1,"totalAssignedJobs":1,"totalRunningJobs":1,"totalRegisteredRunners":1,"totalBusyRunners":1,"totalIdleRunners":0}}
{"severity":"info","ts":"2024-09-11T15:14:44Z","logger":"listener-app.listener","message":"Job started message received.","RequestId":7848922,"RunnerId":8678330}
{"severity":"info","ts":"2024-09-11T15:14:44Z","logger":"listener-app.listener","message":"Deleting last message","lastMessageID":10393}
{"severity":"info","ts":"2024-09-11T15:14:46Z","logger":"listener-app.worker.kubernetesworker","message":"Updating job info for the runner","runnerName":"RUNNER-NAME-runner-x6xq4","ownerName":"ORGANIZATION","repoName":"REPOSITORY","workflowRef":"ORGANIZATION/REPOSITORY/.github/workflows/workflow.yml@refs/heads/main","workflowRunId":9332229,"jobDisplayName":"JOBNAME","requestId":7848922}
{"severity":"info","ts":"2024-09-11T15:14:46Z","logger":"listener-app.worker.kubernetesworker","message":"Updating ephemeral runner with merge patch","json":"{\"status\":{\"jobDisplayName\":\"JOB NAME\",\"jobRepositoryName\":\"ORGANIZATION/REPOSITORY\",\"jobRequestId\":7848922,\"jobWorkflowRef\":\"ORGANIZATION/REPOSITORY/.github/workflows/workflow.yml@refs/heads/main\",\"workflowRunId\":9332229}}"}
{"severity":"info","ts":"2024-09-11T15:14:46Z","logger":"listener-app.worker.kubernetesworker","message":"Ephemeral runner status updated with the merge patch successfully."}
{"severity":"info","ts":"2024-09-11T15:14:46Z","logger":"listener-app.worker.kubernetesworker","message":"Created merge patch json for EphemeralRunnerSet update","json":"{\"spec\":{\"replicas\":1}}"}
{"severity":"info","ts":"2024-09-11T15:14:46Z","logger":"listener-app.worker.kubernetesworker","message":"Scaling ephemeral runner set","assigned job":1,"decision":1,"min":0,"max":20,"currentRunnerCount":-1}
{"severity":"info","ts":"2024-09-11T15:14:46Z","logger":"listener-app.worker.kubernetesworker","message":"Ephemeral runner set scaled.","namespace":"NAMESPACE","name":"RUNNER-NAME","replicas":1}
{"severity":"info","ts":"2024-09-11T15:14:46Z","logger":"listener-app.listener","message":"Getting next message","lastMessageID":10393}
{"severity":"info","ts":"2024-09-11T15:15:36Z","logger":"listener-app.listener","message":"Getting next message","lastMessageID":10393}
{"severity":"info","ts":"2024-09-11T15:16:26Z","logger":"listener-app.listener","message":"Getting next message","lastMessageID":10393}
{"severity":"info","ts":"2024-09-11T15:17:16Z","logger":"listener-app.listener","message":"Getting next message","lastMessageID":10393}
As you can see there are 3 entries of Created merge patch json for EphemeralRunnerSet update the first with replicas:null and the other 2 starting to patch the number of replicas into the ephemeralrunnerset.
The only difference to a non working job is that it does not have the second scale. After the first one with replicas:null it keeps listening for new events but does not scale any pods.
We are using the runnerscaleset: v0.8.3 with the ghcr.io/actions/actions-runner:2.316.1 image.
I even tried to patch manually the ephemeralrunnerset with no success or even increasing the minimum or runners from 0 to 2 on the autoscalingrunnerset the pods appear but are not assigned any jobs.
Any idea where should we look into? As it seems there is an issue between the listener pod and the ephemeralrunnerset but like mentioned even after deleting everything and recreating the same behaviour happens.
Thanks
Hello everyone, In our case the issue we create a separate Runners Group on GHE to register the listeners. For some reason even after deleting the entire deployment the listeners could not fetch any pending jobs until we deleted this Runner Group and Recreate it. After that all the existing labels registered successfully and the jobs started right away. Thanks
Just an update, we kept seeing the listeners being offline on our github instance with no apparent reason. Deleting the autoscalingrunnerset made the listener to be recreated and be able to register again.
As soon as we do this, the jobs are picked up and new runners are created.
We are checking also on our github instance side what might be wrong, since we do not see no errors on the listeners/controller side and for the moment we've implemented a cronjob that deletes the autoscalingrunnerset every hour, checking first if there are any active runners for that label. If so the autoscalingrunnerset is not deleted, if no runners are present we delete the autoscalingrunnerset so it can register again.
I have the same issue and @ricardojdsilva87 comment will solve it. Listener does not report any error logs. Last thing it says (repeatedly) before dying suddenly is:
Preparing EphemeralRunnerSet update
I'm wondering if this has something to do with it, but it shouldn't matter if no jobs are running. Controller should be able to "see" that the listener is missing, kill the set, and recreate. So perhaps if a job is not reporting as exited, it will hang out waiting to delete ephemeral set (which also controls whether the listener exists or not)
https://github.com/actions/actions-runner-controller/blob/8b36ea90ebe81710fcdcb4f96424b43203d24f1e/charts/gha-runner-scale-set-controller/values.yaml#L123
Referencing related issue: https://github.com/actions/actions-runner-controller/issues/3704
Additional details if it helps:
- running with argocd
- controller and runners are in separate NS
- have set minimum runners to 2, so two seem to exist, but listener pod has died
we have experienced the same issue here: we have 2 runner-scale-set for redundancy, both have same name as runner-scale-set, each has different runner group. It used to work, but now workflow hanging with "waiting for runner come online". however, both runner-scale-set are online with green light. We could not see any errors in the arc-system namespace, and it seems like there is no way to check out the cause. For torubleshooting, we have uninstall both helm release of arc and runner-scale-set, and reinstall via helm both of them, but problem persists.
Any tips will be much appreciated!
Hey everyone, this should b resolved now. Can you please confirm you are not getting this error anymore?
I'll close this issue here since we are tracking it on the new issue #3953 Thank you for reporting it and please let us know if you are still experiencing the issue.
Hello @nikola-jokic, is this fix only for 0.11.0?