k8s-device-plugin docker: Error response from daemon: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/moby/761bd05e8ceb95e1459db860b160e9dda095254a969ebd9a0b777524f73f9263/log.json: no such file or directory): exec: "nvidia-container-runtime": executable file not found in $PATH: unknown.

When following the latest instructions to install the nvidia driver on https://github.com/NVIDIA/nvidia-docker/, it says that nvidia-docker2 has been deprecated and one should install the nvidia container toolkit. I followed the instructions for Ubuntu 18.04 with Docker 19.03, however this does not seem to install the nvidia-container-runtime binary mentioned in the README for this project. This results in the docker not being able to start any container after updating the default runtime per the README in /etc/docker/daemon.json. Is this device plugin not compatible with the latest iteration of the nvidia driver? Here is the error message:

docker: Error response from daemon: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/moby/761bd05e8ceb95e1459db860b160e9dda095254a969ebd9a0b777524f73f9263/log.json: no such file or directory): exec: "nvidia-container-runtime": executable file not found in $PATH: unknown.

and just to show:

ls /usr/bin/nvidia-container-runtime
ls: cannot access '/usr/bin/nvidia-container-runtime': No such file or directory

I also tried nvidia-container-cli as this is installed by the current package. Is it possible this repo needs to be updated to reflect nvidia-docker2's deprecation?

May 05 '20 01:05 wjimenez5271

The docs in this repo specifically state that nvidia-container-toolkit should not be used and that nvidia-docker2 should be used instead (even though deprecated) since K8s isn't aware of the --gpus Docker flag yet (not sure if that is still the case).

So it looks like the instructions for Docker and K8s are currently different. I setup per the instructions in this repo for K8s but right now I can't run anything in Docker so I doubt it will work in K8s. When I try to run with the nvidia runtime I get segfaults immediately. Still trying to track that down.

May 19 '20 04:05 ardenpm

I agree, the docs are confusing and should be synchronized better.

Please see my comment here for an explanation on how nvidia-docker2 and nvidia-container-toolkit are related: https://github.com/NVIDIA/k8s-device-plugin/issues/168#issuecomment-625981223

Regarding the segfault, I'm curious if it could be related to: https://github.com/NVIDIA/nvidia-docker/issues/1280#issuecomment-630754999

May 19 '20 11:05 klueska

Indeed, that comment helped make it clear. It was also reassuring to know that basically behind the scenes its basically the same since the deprecation statements on nvidia-docker2 are a bit disconcerting.

Now on the segfault, this was/is really strange. I think mine was actually different to the one in this issue. nvidia-container-cli also would segfault immediately even just using the info commands, so I don’t think it was specific to docker.

All of my testing there was on CentOS 7 latest and I wasn’t able to resolve the problem. Since I needed to do some testing I switched to Ubuntu 18.04 and was not able to replicate the issue there at all.

I still have both images in a stopped state on AWS from my testing so I can probably get more details on the actual segfault stack trace but I am not sure if others are encountering this. The actual error was related to munmap_chunk: invalid pointer.

May 19 '20 12:05 ardenpm

I had the same issue setting up a k8s cluster with GPUs. Went through the comments here and other related issues, and put together the steps to make it work, probably useful to people looking for solution:

Kubernetes NVIDIA GPU device plugin

follow the official NVIDIA GPU device plugin until the step to configure runtime

as explained in this comment, k8s still needs nvidia-container-runtime; install it:

# install the old nvidia-container-runtime for k8s
curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | \
  sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
sudo apt-get update
sudo apt-get install -y nvidia-container-runtime

add the following /etc/docker/daemon.json as required by k8s

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

restart docker and test:

sudo systemctl restart docker
# test that docker can run with GPU without the --gpus flag
docker run nvidia/cuda:10.2-runtime-ubuntu18.04 nvidia-smi

finally, install the NVIDIA device plugin on your cluster:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

Jun 07 '20 07:06 kengz

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

Feb 29 '24 04:02 github-actions[bot]

This issue was automatically closed due to inactivity.

Mar 31 '24 04:03 github-actions[bot]