actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

Docker socket sporadically not available in DinD mode

Open norman-zon opened this issue 2 years ago • 8 comments

Checks

  • [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
  • [X] I am using charts that are officially provided

Controller Version

0.7.0

Deployment Method

Helm

Checks

  • [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

I found no way to reproduce this issue

Describe the bug

In some workflow runs the docker socket at unix:///run/docker/docker.sock is not available inside the runner.

Using any docker command fails with:

Cannot connect to the Docker daemon at unix:///run/docker/docker.sock. Is the docker daemon running?

This happens in less than 1% of my runs and re-running the job always worked.

I found no root cause for this behaviour.

Describe the expected behavior

The docker socket is available under any circumstances.

Additional Context

runnerScaleSetName: "ubuntu-general-dind"
  minRunners: 25
  maxRunners: 120
  template:
    spec:
      containers:
        - name: runner
          image: our-registry/actions-runner:SHA
          command: ["/home/runner/run.sh"]
      resources:
        limits:
          memory: "8Gi"
        requests:
          cpu: "1000m"
          memory: "2Gi"
      nodeSelector:
        class: gha-runners
      tolerations:
        - key: "gha-runners"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"

Controller Logs

could not pull relevant logs due to inability to reproduce

Runner Pod Logs

could not pull relevant logs due to inability to reproduce

norman-zon avatar Jan 04 '24 09:01 norman-zon

Hey @norman-zon,

This issue is likely due to runner starting up before the docker side-car container. Could you please try adding something similar to this to your entrypoint and let us know if the issue persists?

nikola-jokic avatar Jan 05 '24 14:01 nikola-jokic

@nikola-jokic thank you for the suggestion. I will give it a try and report back.

But isn't this something that should be part of the entrypoint of the default image?

norman-zon avatar Jan 09 '24 13:01 norman-zon

Thanks @norman-zon!

But isn't this something that should be part of the entrypoint of the default image?

I think that would be a good idea, to be honest. I will bring this up to the team and let you know :relaxed:

nikola-jokic avatar Jan 23 '24 11:01 nikola-jokic

Just found out this is already part of the new image and is set to 120s by default, when using the helmchart.

But this somehow does not work. I just don't understand why.

norman-zon avatar Feb 01 '24 12:02 norman-zon

Hey @norman-zon,

How did you determine that the entrypoint does not check for the docker socket? I tried spinning it up and using kubectl logs to see the runner log, and at the very top, it says Waiting for docker to be ready.. I set the containerMode.type to dind. Can you please help me better understand the issue?

nikola-jokic avatar Feb 20 '24 13:02 nikola-jokic

I'm sorry, it is in my logs too, just didn't look properly 😞

But since RUNNER_WAIT_FOR_DOCKER_IN_SECONDS ist set by default in the helmchart (since 2023/02), I wonder what caused the problems I described in my first post (that apparently affects others, judging by the thumb-up count).

Anyway I can help to narrow it down? Or maybe someone else who encounters this might speak up with more details?

norman-zon avatar Feb 20 '24 14:02 norman-zon

No worries! I'm failing to reproduce the issue, so any input is welcomed! If you can see if there is some kind of pattern when this issue happens, or if some action fails with this problem, it would be very helpful

nikola-jokic avatar Feb 20 '24 14:02 nikola-jokic

If you are now running k8s 1.29 I would recommend to switch the dind container to a "sidecar" by changeing the container to be started as an initContainer with restartPolicy: Always.

      initContainers:
      - name: dind
        restartPolicy: Always
        ...

This will ensure that dind is running before builds are started. It will also restart the dind container if needed during the runtime of the container.

But when the runner receives a stop signal, we need to handle that as well. We have therefore also added preStop hooks to both the runner and dind containers:

For the runner we have:

          preStop:
            exec:
              command:
                - "/bin/sh"
                - "-c"
                - "echo running > /home/runner/_work/.runner-state; while pgrep Runner.Worker; do echo 'worker process found, sleeping' >/proc/1/fd/1 2>&1; sleep 3; done; rm /home/runner
/_work/.runner-state; echo 'Done, removed /home/runner/_work/.runner-state (or never started)' >/proc/1/fd/1 2>&1"

And for dind container we have:

          preStop:
            exec:
              command:
                - "/bin/sh"
                - "-c"
                - "sleep 5; while grep -q running /home/runner/_work/.runner-state; do echo 'main container has work, sleeping...' >/proc/1/fd/1 2>&1; sleep 3; done; echo 'Did not find running runner. Stopping' >/proc/1/fd/1 2>&1"

Hope this helps :)

larhauga avatar Feb 23 '24 06:02 larhauga

Hey @norman-zon,

Can you please let me know if this is still an issue now that the docker socket path has changed? I don't know if it would influence anything, but I'm curious :relaxed:

nikola-jokic avatar Mar 28 '24 16:03 nikola-jokic

Closing this one since there is a workaround and no activity on it. Please, let us know if the new socket path fixed the issue. :relaxed: We can always re-open it.

nikola-jokic avatar Apr 18 '24 12:04 nikola-jokic