Runners on EKS with an EFS volume in K8s-mode can't start a job pod.
Checks
- [x] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- [x] I am using charts that are officially provided
Controller Version
0.10.1
Deployment Method
Helm
Checks
- [x] This isn't a question or user support case (For Q&A and community support, go to Discussions).
- [x] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
1. Deploy a kubernetes-mode runner in EKS using CONTAINER_HOOKS, with EFS as a RWX storage volume
2. Run a job that only uses the runner container, everything is fine.
3. Run the same job but add the `container:` key to the workflow, the runner pod never gets past "pending"
Describe the bug
First and foremost, has anyone successfully used EFS for the _work volume in kubernetes-mode runners? I can't seem to find any examples, so maybe that's just wrong? I don't know of any other readily available CSI for EKS that supports RWX, which I guess is required for k8s-mode?
All runners, successful or not, show a few error events while waiting for the EFS volume to become available.
Warning FailedScheduling 33s default-scheduler 0/8 nodes are available: waiting for ephemeral volume controller to create the persistentvolumeclaim "arc-amd-8jt4l-runner-kll9r-work". preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling.
I guess EFS is just slow, but I don't know why that would prevent the runner from starting at all.
Describe the expected behavior
I expected a runner with the container: key to create a job pod using that container.
Additional Context
My Runner definition:
githubConfigSecret: github-auth
githubConfigUrl: <url>
controllerServiceAccount:
namespace: gh-controller
name: github-arc
# containerMode:
# kubernetesModeWorkVolumeClaim:
# accessModes: ["ReadWriteOnce"]
template:
spec:
nodeSelector:
beta.kubernetes.io/arch: amd64
serviceAccountName: github-runner
# securityContext:
# fsGroup: 1001
containers:
- name: runner
image: ghcr.io/actions/actions-runner:latest
#image: 823996030995.dkr.ecr.us-west-2.amazonaws.com/github-runner-robust:amd64
command: ["/home/runner/run.sh"]
env:
- name: ACTIONS_RUNNER_CONTAINER_HOOKS
value: /home/runner/k8s/index.js
- name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
value: /home/runner/config/hook-extension.yaml
- name: ACTIONS_RUNNER_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
value: "false"
volumeMounts:
- name: work
mountPath: /home/runner/_work
- name: hook-extension
mountPath: /home/runner/config/hook-extension.yaml
subPath: hook-extension.yaml
volumes:
- name: work
ephemeral:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteMany"]
storageClassName: "gh-efs-sc"
resources:
requests:
storage: 10Gi
- name: hook-extension
configMap:
name: hook-extension
items:
- key: content
path: hook-extension.yaml
The Hook extention only adds a serviceAccountName to the worker pod:
apiVersion: v1
kind: ConfigMap
metadata:
name: hook-extension
data:
content: |
spec:
serviceAccountName: github-runner
The following job will work:
name: Actions Runner Controller
on:
workflow_dispatch:
jobs:
Base-Runner:
runs-on: arc-amd
#container:
# image: alpine:latest
steps:
- run: echo "hooray!"
However, if I uncomment `container:` and `image:` the runner pod gets stuck at `pending` and it never even creates the job pod.
It's worth noting that `fsGroup:` key because that previously got the runner to work, but after some CSI updates it became a problem.
Controller Logs
The controller logs just show this every minute or so while the pod is pending:
2025-01-14T20:52:05Z INFO EphemeralRunnerSet Ephemeral runner counts {"version": "0.10.1", "ephemeralrunnerset": {"name":"arc-amd-8jt4l","namespace":"gh-runners"}, "pending": 1, "running": 0, "finished": 0, "failed": 0, "deleting": 0}
2025-01-14T20:52:05Z INFO EphemeralRunnerSet Scaling comparison {"version": "0.10.1", "ephemeralrunnerset": {"name":"arc-amd-8jt4l","namespace":"gh-runners"}, "current": 1, "desired": 1}
2025-01-14T20:52:05Z INFO AutoscalingRunnerSet Find existing ephemeral runner set {"version": "0.10.1", "autoscalingrunnerset": {"name":"arc-amd","namespace":"gh-runners"}, "name": "arc-amd-8jt4l", "specHash": "76b6bcbfbb"}
Runner Pod Logs
The runner pod never reaches a point where it can produce logs.
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
Same here, using ephemeral EBS volume. Everything looks ok. No errors. But the jobs are not getting picked up.
I was able to work around this by creating an NFS server inside the cluster, then attaching a CSI driver to it so that I can share RWX volumes via a StorageClass. I won't call it a good solution considering how many extra moving parts it requires, but it has worked so far.
I would still like to know if there is a way that ARC is supposed to run on EKS.
Issue: GitHub Actions Runner on EKS 1.31 with EFS StorageClass Fails on ls Step
I'm experiencing the same issues on EKS 1.31 when using GitHub Actions Runners with an EFS-backed StorageClass. The pipeline execution fails at the ls step due to a missing directory.
Configuration Details
StorageClass Definition
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: efs-sc
provisioner: efs.csi.aws.com
reclaimPolicy: Delete
parameters:
provisioningMode: efs-ap
fileSystemId: fs-0ssssssss0xd1
uid: "1001"
gid: "123"
directoryPerms: "775"
basePath: "/github"
subPath: "${.PVC.namespace}"
reuseAccessPoint: "true"
Values for runner installation thru helm Definition** with kubernetes mode
listenerTemplate:
spec:
tolerations:
- key: "runner"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: listener
image: ghcr.io/actions/gha-runner-scale-set-controller:0.10.1
template:
spec:
securityContext:
fsGroup: 123
tolerations:
- key: "default"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: runner
image: ghcr.io/actions/actions-runner:latest
command: [ "/home/runner/run.sh" ]
volumeMounts:
- name: work
mountPath: /home/runner/_work
volumes:
- name: work
ephemeral:
volumeClaimTemplate:
spec:
accessModes: [ "ReadWriteMany" ]
storageClassName: "efs-sc"
resources:
requests:
storage: 10Gi
Github pipeline Definition for tests
name: GitHub Actions Demo
run-name: ${{ github.actor }} is testing out GitHub Actions 🚀
on: [push]
jobs:
Explore-GitHub-Actions:
runs-on: arc-runner-set
container: ubuntu:latest
steps:
- name: Check out repository code
uses: actions/checkout@v4
- run: echo "💡 The ${{ github.repository }} repository has been cloned to the runner."
- run: echo "🖥️ The workflow is now ready to test your code on the runner."
- name: List files in the repository
run: |
ls ${{ github.workspace }}
- run: echo "🍏 This job's status is ${{ job.status }}."
Github pipeline test output, Errs on ls step
Run ls /home/runner/_work/pipes-tests/pipes-tests
ls /home/runner/_work/pipes-tests/pipes-tests
shell: sh -e {0}
Run '/home/runner/k8s/index.js'
shell: /home/runner/externals/node20/bin/node {0}
ls: cannot access '/home/runner/_work/pipes-tests/pipes-tests': No such file or directory
Error: Error: failed to run script step: command terminated with non-zero exit code: error executing command [sh -e /__w/_temp/e8ed3990-efd8-11ef-bca6-b7b5281efb41.sh], exit code 2
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self-hosted runner administrator.
@qspors I'm not sure if this is the same issue or not. In my case, a job that uses the container: key never gets as far as running steps. The worker pod just never gets created because of the volume errors. It seems like yours can create the worker pod, but the pod has some other path (or permissions?) error. But I'm just guessing.
Closing this one since it is not related to ARC. ARC's responsibility is to spin up the runner according to the provided spec. This issue looks to be environment-specific.