actions-runner-controller DinD runner Design is not compatible with Kubernetes/Karpenter & Needs root

Checks

[X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
[X] I am using charts that are officially provided

Controller Version

0.9.2

Deployment Method

Helm

Checks

[X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
[X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

Spin up DinD ARSS runner.
Apply this PR patch on the runner. YOu can override the the file in Docker image using following step, And setup terminationGracePeriod & env variable RUNNER_GRACEFUL_STOP_TIMEOUT COPY --chown=runner:runner run.sh /home/runner/run.sh
Write a workflow with a following step and with above DinD runner.

- name: Run test
        run: |
          sleep 300
          docker run hello-world

Find the runner pod that's triggered for above, anddo kubectl terminate pod PODNAME -n NAMESPACE
For Default/k8s runner, this will wait until the job is completed. But, for Dind It will crash, More on why in the next section

Describe the bug

This is a design bug of Scaled Set DinD runner. This Scaled Set DinD runner is clearly meant to run within the kubernetes, but it's not compatible with terminationGracePeriod & karpenter. Pod Movement is expected and application should respect SIGTERM.

While we can do fix the runner and make default/kuberentes type runners compatible with kubernetes/karpenter easily. DinD runner by it's design makes it harder to implement this. Basically, what happens when kubernetes sends a SIGTERM is:

There are 2 containers running on the pod - First one is runner and Second one is the DinD Containers.
k8s/karpenter sends SIGTERM ( When it decides to move the Pod ) to both the containers running on the pod.
Runner will recieve the SIGTERM and decides to wait until the completion of the job due to RUNNER_GRACEFUL_STOP_TIMEOUT
DinD container will receive the SIGTERM and will immediately exit the process.

A wrapper script that captures the SIGTERM can't properly fix it and need to think this through, see the next section ( additional Context) on how it's a design issue.

Describe the expected behavior

DinD container should wait until the Runner container is finished running. Capturing SIGTERM on DinD Container with wrapper script and waiting for completion of docker usage ( with run/build) will not work because there could be more workflow steps in the github action that need the DinD container.

Potential Solutions

Thoughts on this Design Bug: Because DinD container should wait for Runner Container, following are some of the approaches.

On the SIGTERM Wrapper for DinD script polls for the runner status using Github API.
Create a volume and Mount the same volume on both the container.
-- Use FileSystem as IPC mechanism, and watch that file in the DinD's SIGTERM trap. -- Use shareProcessNamespace and DinD's lifecycle preStop to watch the Runner process.
For Kubernetes 1.29+ , there is a new feature called Native sidecar containers, DinD can be moved there which keeps it for the lifetime of the main container runner
Combine DinD and runner into one container, I have explained it in the next section.

Combine DinD and runner & run it as rootless:

We can install Rootless Docker or daemonless PodMan into the runner itself and use that. There is also another problem of ScaledSet runs DinD as root user , So It's better to look into rootless Docker or Podman as well . Following are the benefits of this approach.

Solves the multi container orchestration problem with SIGTERM.
rootless docker-in-docker containers
Amortized kubernetes resources ( limits/resources) as opposed to dedicated runner/dind container. For most workloads, the resources are used at point of time for either runner or docker. For example, When you do git checkout on the runner, it uses runner resources but docker sits idle. But, when you do docker run, dind container is busy but runner sits idle.

Controller Logs

NA

Runner Pod Logs

NA

Jun 14 '24 17:06 jaswanthikolla

This PR is a decent workaround for this issue, but I think we can do better than that.

Jun 16 '24 15:06 jaswanthikolla

Interesting...

Jun 24 '24 10:06 gfrid

@jaswanthikolla I like the suggestion of combining dind and the runner into a single image. Were you able to make any progress towards that?

Sep 10 '24 15:09 shankarRaman

🤔 +1

Nov 11 '24 13:11 thomaschaplin

https://github.com/actions/actions-runner-controller/pull/3842 - moving it as a Sidecar would also solve the issue.

Dec 12 '24 10:12 velkovb