DinD runner Design is not compatible with Kubernetes/Karpenter & Needs root
Checks
- [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I am using charts that are officially provided
Controller Version
0.9.2
Deployment Method
Helm
Checks
- [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
- [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
- Spin up DinD ARSS runner.
- Apply this PR patch on the runner. YOu can override the the file in Docker image using following step, And setup terminationGracePeriod & env variable
RUNNER_GRACEFUL_STOP_TIMEOUTCOPY --chown=runner:runner run.sh /home/runner/run.sh - Write a workflow with a following step and with above DinD runner.
- name: Run test
run: |
sleep 300
docker run hello-world
- Find the runner pod that's triggered for above, anddo kubectl terminate pod PODNAME -n NAMESPACE
- For Default/k8s runner, this will wait until the job is completed. But, for Dind It will crash, More on why in the next section
Describe the bug
This is a design bug of Scaled Set DinD runner. This Scaled Set DinD runner is clearly meant to run within the kubernetes, but it's not compatible with terminationGracePeriod & karpenter. Pod Movement is expected and application should respect SIGTERM.
While we can do fix the runner and make default/kuberentes type runners compatible with kubernetes/karpenter easily. DinD runner by it's design makes it harder to implement this. Basically, what happens when kubernetes sends a SIGTERM is:
- There are 2 containers running on the pod - First one is runner and Second one is the DinD Containers.
- k8s/karpenter sends SIGTERM ( When it decides to move the Pod ) to both the containers running on the pod.
- Runner will recieve the SIGTERM and decides to wait until the completion of the job due to RUNNER_GRACEFUL_STOP_TIMEOUT
- DinD container will receive the SIGTERM and will immediately exit the process.
A wrapper script that captures the SIGTERM can't properly fix it and need to think this through, see the next section ( additional Context) on how it's a design issue.
Describe the expected behavior
DinD container should wait until the Runner container is finished running. Capturing SIGTERM on DinD Container with wrapper script and waiting for completion of docker usage ( with run/build) will not work because there could be more workflow steps in the github action that need the DinD container.
Potential Solutions
Thoughts on this Design Bug: Because DinD container should wait for Runner Container, following are some of the approaches.
- On the SIGTERM Wrapper for DinD script polls for the runner status using Github API.
- Create a volume and Mount the same volume on both the container.
-- Use FileSystem as IPC mechanism, and watch that file in the DinD's SIGTERM trap. -- UseshareProcessNamespaceand DinD's lifecycle preStop to watch the Runner process. - For Kubernetes 1.29+ , there is a new feature called Native sidecar containers, DinD can be moved there which keeps it for the lifetime of the main container runner
- Combine DinD and runner into one container, I have explained it in the next section.
Combine DinD and runner & run it as rootless:
We can install Rootless Docker or daemonless PodMan into the runner itself and use that. There is also another problem of ScaledSet runs DinD as root user , So It's better to look into rootless Docker or Podman as well . Following are the benefits of this approach.
- Solves the multi container orchestration problem with SIGTERM.
- rootless docker-in-docker containers
- Amortized kubernetes resources ( limits/resources) as opposed to dedicated runner/dind container. For most workloads, the resources are used at point of time for either runner or docker. For example, When you do git checkout on the runner, it uses runner resources but docker sits idle. But, when you do
docker run, dind container is busy but runner sits idle.
Controller Logs
NA
Runner Pod Logs
NA
This PR is a decent workaround for this issue, but I think we can do better than that.
Interesting...
@jaswanthikolla I like the suggestion of combining dind and the runner into a single image. Were you able to make any progress towards that?
🤔 +1
https://github.com/actions/actions-runner-controller/pull/3842 - moving it as a Sidecar would also solve the issue.