ARC not working with ResourceQuotas. Fails to schedule pod instead of queuing.
Checks
- [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I am using charts that are officially provided
Controller Version
0.9.3
Deployment Method
Helm
Checks
- [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
- [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
You need to request more cpu,memory,storage etc. than the resource quota. For example:
- Define resource quota with hard limit for 8 cpus
apiVersion: v1
kind: ResourceQuota
metadata:
name: arc-runners-quota
namespace: arc-runners
spec:
hard:
requests.cpu: "8"
- Set up ARC with autoscalingrunnerset with more than half of the cpus for runner container resource requests: let's do 5.
apiVersion: actions.github.com/v1alpha1
kind: AutoscalingRunnerSet
metadata:
name: self-hosted
namespace: arc-runners
spec:
...
template:
spec:
containers:
- name: runner
resources:
requests:
cpu: "5"
...
- Run a workflow with two jobs matching the autoscalingunnerset name
push:
branches: [ main ]
jobs:
job1:
runs-on: self-hosted
steps:
- run: sleep 60
job2:
runs-on: self-hosted
steps:
- run: sleep 60
You can also do this with a single job that goes over the resource quota. But above is more likely scenario.
Describe the bug
In the example provided, one job will run, the other will get stuck waiting for a runner as the ephemeralrunner ends up in Failed state.
In general, jobs get stuck waiting for a runner that never appears until another job is scheduled for same runner scale set.
Describe the expected behavior
In the example provided, one job should run at a time and queue properly and complete one after the other. Leading to a successful build.
In general, when quota is temporarily exceeded, we should try again after a while preferably through a queue implementation.
Additional Context
Previous issues where removing ResourceQuota helped:
https://github.com/actions/actions-runner-controller/issues/3211#issuecomment-1883410610
https://github.com/actions/actions-runner-controller/issues/3191#issuecomment-1883407473
Controller Logs
https://gist.github.com/ropelli/86ac726df685716b2e7e510a72e63139
Runner Pod Logs
No runner pod. No logs
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
I created a separate issue for k8s mode and container hooks: https://github.com/actions/runner-container-hooks/issues/229. There the jobs fail if there's not enough quota at the moment available.
Following change seems to "fix" at least the simple case with two jobs but this is probably not the way to go and I would not recommend it:
diff --git a/controllers/actions.github.com/ephemeralrunner_controller.go b/controllers/actions.github.com/ephemeralrunner_controller.go
index 36ea114..be82878 100644
--- a/controllers/actions.github.com/ephemeralrunner_controller.go
+++ b/controllers/actions.github.com/ephemeralrunner_controller.go
@@ -21,6 +21,7 @@ import (
"errors"
"fmt"
"net/http"
+ "strings"
"time"
"github.com/actions/actions-runner-controller/apis/actions.github.com/v1alpha1"
@@ -216,6 +217,21 @@ func (r *EphemeralRunnerReconciler) Reconcile(ctx context.Context, req ctrl.Requ
case err == nil:
return result, nil
case kerrors.IsInvalid(err) || kerrors.IsForbidden(err):
+ if strings.Contains(err.Error(), "exceeded quota") {
+ log.Info("Failed to create a pod due to quota exceeded. Let's try again later")
+ log.Error(err, "Error: ")
+ err := r.Patch(ctx, ephemeralRunner, client.RawPatch(types.MergePatchType, []byte(`{"metadata":{"finalizers":[]}}`)))
+ if err != nil {
+ log.Error(err, "Error: ")
+ return ctrl.Result{}, err
+ }
+ err = r.Delete(ctx, ephemeralRunner)
+ if err != nil {
+ log.Error(err, "Error: ")
+ return ctrl.Result{}, err
+ }
+ return ctrl.Result{}, nil
+ }
log.Error(err, "Failed to create a pod due to unrecoverable failure")
errMessage := fmt.Sprintf("Failed to create the pod: %v", err)
if err := r.markAsFailed(ctx, ephemeralRunner, errMessage, ReasonInvalidPodFailure, log); err != nil {
Thank you @ropelli for bringing this up. We are facing exactly the same issue - the EphemeralRunner ends up in Failed state, and RunnerSet is not spawning another runner, if the number or runners, also those in Failed state, equals running workflows. Removing the EphemeralRunner helps in this situation - new one is created and able to pick up the work. Currently this requires manual action, or additional code to track and remove such failed EphemeralRunners, and it's not ideal nor, I guess, valid approach.
In my opinion it would be good to see all Failed EphemeralRunners (it gives us context what happened) at least for some time, but have Controller able to recover, and spawn new ones, when one of them fails to start.
Awaiting ARC team response :)
any news on that?
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
What exactly is meant by "shortly" in this context?
We are seeing something similar here using version 0.10.1 of the autoscaling runner scale set and controller. We have resource quota in place on the namespace where we have our runners.
If scheduling an additional runner would breach the quota, they simply fail to schedule and do not try to queue it. On GitHub side the job stays pending waiting for a runner that will never come.
Additionally : at glance, it appears that the "used limits" are stored somewhere by the controller and not queried again from the K8s API. I'm quite confident there were more than enough "limits" available when it tried to schedule the second runner.
kubectl describe quota -n gha-runner
Name: hard-gha-runner
Namespace: gha-runner
Resource Used Hard
-------- ---- ----
limits.cpu 150m 21
limits.memory 250Mi 21Gi
kubectl get ephemeralrunner -n gha-runner
NAME GITHUB CONFIG URL RUNNERID STATUS JOBREPOSITORY JOBWORKFLOWREF WORKFLOWRUNID JOBDISPLAYNAME MESSAGE AGE
k8s-dev01-cq46h-runner-9p6z2 https://github.com/someorg 163 Failed Failed to create the pod: pods "k8s-dev01-cq46h-runner-9p6z2" is forbidden: exceeded quota: hard-gha-runner requested: limits.cpu=8,limits.memory=8Gi, used: limits.cpu=17350m,limits.memory=21242Mi, limited: limits.cpu=21,limits.memory=21Gi 2d6h
k8s-dev01-cq46h-runner-bwk8h https://github.com/someorg 165 Failed Failed to create the pod: pods "k8s-dev01-cq46h-runner-bwk8h" is forbidden: exceeded quota: hard-gha-runner requested: limits.cpu=8,limits.memory=8Gi, used: limits.cpu=17350m,limits.memory=21242Mi, limited: limits.cpu=21,limits.memory=21Gi 6h18m