k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

docker: Error response from daemon: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/moby/761bd05e8ceb95e1459db860b160e9dda095254a969ebd9a0b777524f73f9263/log.json: no such file or directory): exec: "nvidia-container-runtime": executable file not found in $PATH: unknown.

Open wjimenez5271 opened this issue 5 years ago • 4 comments

When following the latest instructions to install the nvidia driver on https://github.com/NVIDIA/nvidia-docker/, it says that nvidia-docker2 has been deprecated and one should install the nvidia container toolkit. I followed the instructions for Ubuntu 18.04 with Docker 19.03, however this does not seem to install the nvidia-container-runtime binary mentioned in the README for this project. This results in the docker not being able to start any container after updating the default runtime per the README in /etc/docker/daemon.json. Is this device plugin not compatible with the latest iteration of the nvidia driver? Here is the error message:

docker: Error response from daemon: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/moby/761bd05e8ceb95e1459db860b160e9dda095254a969ebd9a0b777524f73f9263/log.json: no such file or directory): exec: "nvidia-container-runtime": executable file not found in $PATH: unknown.

and just to show:

ls /usr/bin/nvidia-container-runtime
ls: cannot access '/usr/bin/nvidia-container-runtime': No such file or directory

I also tried nvidia-container-cli as this is installed by the current package. Is it possible this repo needs to be updated to reflect nvidia-docker2's deprecation?

wjimenez5271 avatar May 05 '20 01:05 wjimenez5271

The docs in this repo specifically state that nvidia-container-toolkit should not be used and that nvidia-docker2 should be used instead (even though deprecated) since K8s isn't aware of the --gpus Docker flag yet (not sure if that is still the case).

So it looks like the instructions for Docker and K8s are currently different. I setup per the instructions in this repo for K8s but right now I can't run anything in Docker so I doubt it will work in K8s. When I try to run with the nvidia runtime I get segfaults immediately. Still trying to track that down.

ardenpm avatar May 19 '20 04:05 ardenpm

I agree, the docs are confusing and should be synchronized better.

Please see my comment here for an explanation on how nvidia-docker2 and nvidia-container-toolkit are related: https://github.com/NVIDIA/k8s-device-plugin/issues/168#issuecomment-625981223

Regarding the segfault, I'm curious if it could be related to: https://github.com/NVIDIA/nvidia-docker/issues/1280#issuecomment-630754999

klueska avatar May 19 '20 11:05 klueska

Indeed, that comment helped make it clear. It was also reassuring to know that basically behind the scenes its basically the same since the deprecation statements on nvidia-docker2 are a bit disconcerting.

Now on the segfault, this was/is really strange. I think mine was actually different to the one in this issue. nvidia-container-cli also would segfault immediately even just using the info commands, so I don’t think it was specific to docker.

All of my testing there was on CentOS 7 latest and I wasn’t able to resolve the problem. Since I needed to do some testing I switched to Ubuntu 18.04 and was not able to replicate the issue there at all.

I still have both images in a stopped state on AWS from my testing so I can probably get more details on the actual segfault stack trace but I am not sure if others are encountering this. The actual error was related to munmap_chunk: invalid pointer.

ardenpm avatar May 19 '20 12:05 ardenpm

I had the same issue setting up a k8s cluster with GPUs. Went through the comments here and other related issues, and put together the steps to make it work, probably useful to people looking for solution:

Kubernetes NVIDIA GPU device plugin

  • follow the official NVIDIA GPU device plugin until the step to configure runtime

  • as explained in this comment, k8s still needs nvidia-container-runtime; install it:

    # install the old nvidia-container-runtime for k8s
    curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | \
      sudo apt-key add -
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | \
      sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
    sudo apt-get update
    sudo apt-get install -y nvidia-container-runtime
    
  • add the following /etc/docker/daemon.json as required by k8s

    {
        "default-runtime": "nvidia",
        "runtimes": {
            "nvidia": {
                "path": "/usr/bin/nvidia-container-runtime",
                "runtimeArgs": []
            }
        }
    }
    
  • restart docker and test:

    sudo systemctl restart docker
    # test that docker can run with GPU without the --gpus flag
    docker run nvidia/cuda:10.2-runtime-ubuntu18.04 nvidia-smi
    
  • finally, install the NVIDIA device plugin on your cluster:

    kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml
    

kengz avatar Jun 07 '20 07:06 kengz

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Feb 29 '24 04:02 github-actions[bot]

This issue was automatically closed due to inactivity.

github-actions[bot] avatar Mar 31 '24 04:03 github-actions[bot]