When there are nodes with containerd runtime in the cluster, it can cause nodes running on the Docker runtime to break down

Open quanguachong opened this issue 2 years ago • 1 comments

Issue or feature description

My K8s Cluster has 2 nodes with NVIDIA GPU:

node1 containerRuntime is docker
node2 containerRuntime is containerd

$ k get node -o wide
NAME    STATUS                     ROLES           AGE    VERSION    INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
master    Ready                      control-plane   261d   v1.24.10   100.64.4.51    <none>        Ubuntu 20.04.5 LTS   5.4.0-166-generic   docker://20.10.20
node1    Ready                      compute         261d   v1.24.10   100.64.4.181   <none>        Ubuntu 20.04.5 LTS   5.4.0-144-generic   docker://20.10.20
node2    Ready                      compute         259d   v1.24.10   100.64.4.62    <none>        Ubuntu 20.04.5 LTS   5.4.0-144-generic   containerd://1.6.4

Docker in node1 will crash after deploying GPU-Operator v23.9.0

The reason is: GPU Operator set runtime as containerd -- if >=1 node is configured with containerd(reference). Then GPU Operator set RUNTIME as containerd in daemonset nvidia-container-toolkit-daemonset whose pod running in nodes node1 and node2.

Expectation

GPU operator support nodes in the cluster that use both containerd and Docker as containerRuntimes at the same time。

Nov 28 '23 10:11 quanguachong

@quanguachong we do no support this configuration currently. You can make this work by installing container-toolkit packages manually on the node and disabled toolkit container with the gpu-operator. This scenario is documented here:

Dec 21 '23 01:12 shivamerla