ARC in k8s mode not working with ResourceQuotas. Jobs fail instead of queuing.
Checks
- [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I am using charts that are officially provided
Controller Version
0.9.3
Deployment Method
Helm
Checks
- [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
- [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
In k8s mode using the container hook template, you need to request more cpu,memory,storage etc. than the resource quota. For example:
- Define resource quota with hard limit for cpus/memory e.g. 8 cpus
apiVersion: v1
kind: ResourceQuota
metadata:
name: arc-runners-quota
namespace: arc-runners
spec:
hard:
requests.cpu: "8"
- Set up ARC with autoscalingrunnerset k8s mode with a container hook template
apiVersion: actions.github.com/v1alpha1
kind: AutoscalingRunnerSet
metadata:
name: self-hosted-k8s
namespace: arc-runners
spec:
...
template:
spec:
containers:
- name: runner
resources:
requests:
cpu: "0"
env:
...
- name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
value: /home/runner/workflow-pod-config/workflow-pod-config.yaml
...
volumeMounts:
...
- mountPath: /home/runner/workflow-pod-config
name: workflow-pod-config
volumes:
...
- name: workflow-pod-config
configMap:
name: workflow-pod-config
items:
- key: workflow-pod-config.yaml
path: workflow-pod-config.yaml
...
---
apiVersion: v1
kind: ConfigMap
metadata:
name: workflow-pod-config
namespace: arc-runners
data:
workflow-pod-config.yaml: |
apiVersion: v1
kind: PodTemplate
metadata:
labels:
app: runner-pod-template
spec:
containers:
- name: $job
resources:
requests:
cpu: "5"
- Run a workflow with two jobs matching the autoscalingunnerset name
on:
push:
branches: [ main ]
jobs:
job1:
runs-on: self-hosted-k8s
container:
image: ubuntu:22.04
steps:
- run: sleep 60
job2:
runs-on: self-hosted-k8s
container:
image: ubuntu:22.04
steps:
- run: sleep 60
You can also do this with a single job that goes over the resource quota. But above is more likely scenario.
Describe the bug
In the example provided, one job will run successfully, the other will fail when trying to create the workflow pod as the resource quota is temporarily being exceeded.
In general, jobs fail due to quota being temporarily being exceeded.
Describe the expected behavior
In the example provided, one job should run at a time and queue properly and complete one after the other. Leading to a successful build.
In general, when quota is temporarily exceeded, we should try again after a while preferably through a queue implementation. Though it is also bit stupid that the runner starts even when there is no room for the workflow pod but I am not sure what could be done without changing the architecture of ARC a lot.
Additional Context
This can be considered separate issue than https://github.com/actions/actions-runner-controller/issues/3629 as I believe the code needs to change in a different spot and it behaves a little bit differently. This is why I created two issues.
Controller Logs
https://gist.github.com/ropelli/2260d7303e4a09c75170105ee1afdaac
Runner Pod Logs
https://gist.github.com/ropelli/8d8ea405ad54128c24e42292a3aaeb09