When there are nodes with containerd runtime in the cluster, it can cause nodes running on the Docker runtime to break down
Issue or feature description
My K8s Cluster has 2 nodes with NVIDIA GPU:
- node1 containerRuntime is docker
- node2 containerRuntime is containerd
$ k get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
master Ready control-plane 261d v1.24.10 100.64.4.51 <none> Ubuntu 20.04.5 LTS 5.4.0-166-generic docker://20.10.20
node1 Ready compute 261d v1.24.10 100.64.4.181 <none> Ubuntu 20.04.5 LTS 5.4.0-144-generic docker://20.10.20
node2 Ready compute 259d v1.24.10 100.64.4.62 <none> Ubuntu 20.04.5 LTS 5.4.0-144-generic containerd://1.6.4
Docker in node1 will crash after deploying GPU-Operator v23.9.0
The reason is: GPU Operator set runtime as containerd -- if >=1 node is configured with containerd(reference). Then GPU Operator set RUNTIME as containerd in daemonset nvidia-container-toolkit-daemonset whose pod running in nodes node1 and node2.
Expectation
GPU operator support nodes in the cluster that use both containerd and Docker as containerRuntimes at the same time。
@quanguachong we do no support this configuration currently. You can make this work by installing container-toolkit packages manually on the node and disabled toolkit container with the gpu-operator. This scenario is documented here: