gpu-operator
gpu-operator copied to clipboard
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for “nvidia” is configured
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): 20.04
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): EKS
- GPU Operator Version: v23.3.2
2. Issue or feature description
It works for first setup and work fine when node is started and after a period of time then it is error as below:
$ kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-gj7v5 0/1 Init:0/1 0 3m31s
gpu-feature-discovery-zwpvt 0/1 Init:0/1 0 4m5s
gpu-operator-5f5589bb7c-4mpgw 1/1 Running 0 13h
gpu-operator-gpu-operator-nvidia-node-feature-discovery-ma8pjr8 1/1 Running 0 13h
gpu-operator-gpu-operator-nvidia-node-feature-discovery-wo6zbqv 1/1 Running 0 48d
gpu-operator-gpu-operator-nvidia-node-feature-discovery-wo7hz58 1/1 Running 0 48d
gpu-operator-gpu-operator-nvidia-node-feature-discovery-wopdzng 1/1 Running 0 28d
gpu-operator-gpu-operator-nvidia-node-feature-discovery-worbzxr 1/1 Running 0 24d
gpu-operator-gpu-operator-nvidia-node-feature-discovery-worzplr 1/1 Running 0 23h
gpu-operator-gpu-operator-nvidia-node-feature-discovery-wovhh5r 1/1 Running 0 46d
gpu-operator-gpu-operator-nvidia-node-feature-discovery-woxg6rj 1/1 Running 0 28d
gpu-operator-gpu-operator-nvidia-node-feature-discovery-woxjpx6 1/1 Running 0 23h
nvidia-container-toolkit-daemonset-2trdc 0/1 Init:0/1 0 4m5s
nvidia-container-toolkit-daemonset-7z44h 0/1 Init:0/1 0 3m31s
nvidia-dcgm-exporter-hlq27 0/1 Init:0/1 0 4m5s
nvidia-dcgm-exporter-z9v66 0/1 Init:0/1 0 3m31s
nvidia-device-plugin-daemonset-hsqnw 0/1 Init:0/1 0 3m31s
nvidia-device-plugin-daemonset-xn5m8 0/1 Init:0/1 0 4m5s
nvidia-driver-daemonset-ht7bc 0/1 Init:CrashLoopBackOff 92 (3m31s ago) 7h32m
nvidia-driver-daemonset-ngbdl 0/1 Init:CrashLoopBackOff 91 (4m5s ago) 7h29m
nvidia-operator-validator-8kzrf 0/1 Init:0/4 0 3m31s
nvidia-operator-validator-ms7k9 0/1 Init:0/4 0 4m5s
When we check event of the namespace then there is error like this
27m Warning FailedCreatePodSandBox pod/nvidia-operator-validator-xhqvx Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
3. Steps to reproduce the issue
helm value file
gpuOperator:
operator:
# upgrade CRD on chart upgrade, requires --disable-openapi-validation flag
# to be passed during helm upgrade.
upgradeCRD: false
initContainer:
image: cuda
repository: nvcr.io/nvidia
version: 11.7.1-base-ubi8
imagePullPolicy: IfNotPresent
resources:
requests:
cpu: 5m
memory: 80Mi
driver:
enabled: true
repository: nvcr.io/nvidia
image: driver
version: "515.105.01"
imagePullPolicy: IfNotPresent
resources:
requests:
cpu: 5m
memory: 10Mi
toolkit:
enabled: true
resources:
requests:
cpu: 5m
memory: 10Mi
devicePlugin:
enabled: true
resources:
requests:
cpu: 5m
memory: 10Mi
dcgmExporter:
resources:
requests:
cpu: 5m
memory: 230Mi
serviceMonitor:
enabled: true
interval: 30s
honorLabels: false
additionalLabels:
release: prometheus
gfd:
enabled: true
resources:
requests:
cpu: 5m
memory: 25Mi
vfioManager:
enabled: true
repository: nvcr.io/nvidia
image: cuda
version: 11.7.1-base-ubi8
imagePullPolicy: IfNotPresent
node-feature-discovery:
enableNodeFeatureApi: true
master:
resources:
requests:
cpu: 5m
memory: 80Mi
worker:
resources:
requests:
cpu: 5m
memory: 20Mi
Any suggestion to resolve this issue would be very much appreciated!
Thanks!