actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

Runners on EKS with an EFS volume in K8s-mode can't start a job pod.

Open sierrasoleil opened this issue 1 year ago • 5 comments

Checks

  • [x] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
  • [x] I am using charts that are officially provided

Controller Version

0.10.1

Deployment Method

Helm

Checks

  • [x] This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • [x] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Deploy a kubernetes-mode runner in EKS using CONTAINER_HOOKS, with EFS as a RWX storage volume
2. Run a job that only uses the runner container, everything is fine.
3. Run the same job but add the `container:` key to the workflow, the runner pod never gets past "pending"

Describe the bug

First and foremost, has anyone successfully used EFS for the _work volume in kubernetes-mode runners? I can't seem to find any examples, so maybe that's just wrong? I don't know of any other readily available CSI for EKS that supports RWX, which I guess is required for k8s-mode?

All runners, successful or not, show a few error events while waiting for the EFS volume to become available.

Warning  FailedScheduling  33s   default-scheduler  0/8 nodes are available: waiting for ephemeral volume controller to create the persistentvolumeclaim "arc-amd-8jt4l-runner-kll9r-work". preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling.

I guess EFS is just slow, but I don't know why that would prevent the runner from starting at all.

Describe the expected behavior

I expected a runner with the container: key to create a job pod using that container.

Additional Context

My Runner definition:

githubConfigSecret: github-auth
githubConfigUrl: <url>

controllerServiceAccount:
  namespace: gh-controller
  name: github-arc

# containerMode:
#   kubernetesModeWorkVolumeClaim:
#     accessModes: ["ReadWriteOnce"]

template:
  spec:
    nodeSelector:
      beta.kubernetes.io/arch: amd64
    serviceAccountName: github-runner
    # securityContext:
    #   fsGroup: 1001
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        #image: 823996030995.dkr.ecr.us-west-2.amazonaws.com/github-runner-robust:amd64
        command: ["/home/runner/run.sh"]
        env:
          - name: ACTIONS_RUNNER_CONTAINER_HOOKS
            value: /home/runner/k8s/index.js
          - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
            value: /home/runner/config/hook-extension.yaml
          - name: ACTIONS_RUNNER_POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
            value: "false"
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
          - name: hook-extension
            mountPath: /home/runner/config/hook-extension.yaml
            subPath: hook-extension.yaml
    volumes:
      - name: work
        ephemeral:
          volumeClaimTemplate:
            spec:
              accessModes: ["ReadWriteMany"]
              storageClassName: "gh-efs-sc"
              resources:
                requests:
                  storage: 10Gi
      - name: hook-extension
        configMap:
          name: hook-extension
          items:
            - key: content
              path: hook-extension.yaml


The Hook extention only adds a serviceAccountName to the worker pod:

apiVersion: v1
kind: ConfigMap
metadata:
  name: hook-extension
data:
  content: |
    spec:
      serviceAccountName: github-runner


The following job will work:

name: Actions Runner Controller
on:
  workflow_dispatch:
jobs:
  Base-Runner:
    runs-on: arc-amd
    #container:
    #  image: alpine:latest
    steps:
      - run: echo "hooray!"


However, if I uncomment `container:` and `image:` the runner pod gets stuck at `pending` and it never even creates the job pod.

It's worth noting that `fsGroup:` key because that previously got the runner to work, but after some CSI updates it became a problem.

Controller Logs

The controller logs just show this every minute or so while the pod is pending:

2025-01-14T20:52:05Z    INFO    EphemeralRunnerSet      Ephemeral runner counts {"version": "0.10.1", "ephemeralrunnerset": {"name":"arc-amd-8jt4l","namespace":"gh-runners"}, "pending": 1, "running": 0, "finished": 0, "failed": 0, "deleting": 0}
2025-01-14T20:52:05Z    INFO    EphemeralRunnerSet      Scaling comparison      {"version": "0.10.1", "ephemeralrunnerset": {"name":"arc-amd-8jt4l","namespace":"gh-runners"}, "current": 1, "desired": 1}
2025-01-14T20:52:05Z    INFO    AutoscalingRunnerSet    Find existing ephemeral runner set      {"version": "0.10.1", "autoscalingrunnerset": {"name":"arc-amd","namespace":"gh-runners"}, "name": "arc-amd-8jt4l", "specHash": "76b6bcbfbb"}

Runner Pod Logs

The runner pod never reaches a point where it can produce logs.

sierrasoleil avatar Jan 14 '25 21:01 sierrasoleil

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

github-actions[bot] avatar Jan 14 '25 21:01 github-actions[bot]

Same here, using ephemeral EBS volume. Everything looks ok. No errors. But the jobs are not getting picked up.

alexsorkin avatar Jan 30 '25 04:01 alexsorkin

I was able to work around this by creating an NFS server inside the cluster, then attaching a CSI driver to it so that I can share RWX volumes via a StorageClass. I won't call it a good solution considering how many extra moving parts it requires, but it has worked so far.

I would still like to know if there is a way that ARC is supposed to run on EKS.

sierrasoleil avatar Jan 30 '25 19:01 sierrasoleil

Issue: GitHub Actions Runner on EKS 1.31 with EFS StorageClass Fails on ls Step

I'm experiencing the same issues on EKS 1.31 when using GitHub Actions Runners with an EFS-backed StorageClass. The pipeline execution fails at the ls step due to a missing directory.


Configuration Details

StorageClass Definition

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: efs-sc
provisioner: efs.csi.aws.com
reclaimPolicy: Delete
parameters:
  provisioningMode: efs-ap
  fileSystemId: fs-0ssssssss0xd1
  uid: "1001"
  gid: "123"
  directoryPerms: "775"
  basePath: "/github"
  subPath: "${.PVC.namespace}"
  reuseAccessPoint: "true"

Values for runner installation thru helm Definition** with kubernetes mode

listenerTemplate:
  spec:
    tolerations:
      - key: "runner"
        operator: "Exists"
        effect: "NoSchedule"
    containers:
      - name: listener
        image: ghcr.io/actions/gha-runner-scale-set-controller:0.10.1
template:
  spec:
    securityContext:
      fsGroup: 123
    tolerations:
      - key: "default"
        operator: "Exists"
        effect: "NoSchedule"
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: [ "/home/runner/run.sh" ]
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
    volumes:
      - name: work
        ephemeral:
          volumeClaimTemplate:
            spec:
              accessModes: [ "ReadWriteMany" ]
              storageClassName: "efs-sc"
              resources:
                requests:
                  storage: 10Gi

Github pipeline Definition for tests

name: GitHub Actions Demo
run-name: ${{ github.actor }} is testing out GitHub Actions 🚀
on: [push]
jobs:
  Explore-GitHub-Actions:
    runs-on: arc-runner-set
    container: ubuntu:latest
    steps:
      - name: Check out repository code
        uses: actions/checkout@v4
      - run: echo "💡 The ${{ github.repository }} repository has been cloned to the runner."
      - run: echo "🖥️ The workflow is now ready to test your code on the runner."
      - name: List files in the repository
        run: |
          ls ${{ github.workspace }}
      - run: echo "🍏 This job's status is ${{ job.status }}."

Github pipeline test output, Errs on ls step

Run ls /home/runner/_work/pipes-tests/pipes-tests
  ls /home/runner/_work/pipes-tests/pipes-tests
  shell: sh -e {0}
Run '/home/runner/k8s/index.js'
  shell: /home/runner/externals/node20/bin/node {0}
  
ls: cannot access '/home/runner/_work/pipes-tests/pipes-tests': No such file or directory
Error: Error: failed to run script step: command terminated with non-zero exit code: error executing command [sh -e /__w/_temp/e8ed3990-efd8-11ef-bca6-b7b5281efb41.sh], exit code 2
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self-hosted runner administrator.

Image

yuriyyurov avatar Feb 20 '25 22:02 yuriyyurov

@qspors I'm not sure if this is the same issue or not. In my case, a job that uses the container: key never gets as far as running steps. The worker pod just never gets created because of the volume errors. It seems like yours can create the worker pod, but the pod has some other path (or permissions?) error. But I'm just guessing.

sierrasoleil avatar Feb 21 '25 00:02 sierrasoleil

Closing this one since it is not related to ARC. ARC's responsibility is to spin up the runner according to the provided spec. This issue looks to be environment-specific.

nikola-jokic avatar Aug 06 '25 10:08 nikola-jokic