gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Getting Error: "stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown" while deploying gpu operator -22.9.0 on SLES 15 SP4

Open ATP-55 opened this issue 3 years ago • 9 comments

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

  • [ ] Are you running on an Ubuntu 18.04 node? No. SUSE Linux Enterprise Server 15 SP4
  • [ ] Are you running Kubernetes v1.13+? Yes. K8s v1.21.10
  • [ ] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? containerd github.com/containerd/containerd v1.6.1
  • [ ] Do you have i2c_core and ipmi_msghandler loaded on the nodes? Yes
  • [ ] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

Trying to deploy gpu operator -22.9.0 on SLES 15 SP4. Worker node is already installed with Nvidia-Driver. But getting error on below pods: nvidia-dcgm-exporter-2xlz8 0/1 Init:CrashLoopBackOff 7 12m nvidia-device-plugin-daemonset-nmb7r 0/1 Init:CrashLoopBackOff 7 12m nvidia-operator-validator-v8xn9 0/1 Init:CrashLoopBackOff 7 12m gpu-feature-discovery-9t6sp 0/1 Init:CrashLoopBackOff 7 12m

2. Steps to reproduce the issue

Install GPU operator Helm chart 22.9.0 on SLES 15 SP4

3. Information to attach (optional if deemed irrelevant)

  • [ ] kubernetes pods status: kubectl get pods --all-namespaces SGH123VZMJ:/home/edison/atp/gpu-operator-22.9.0/deployments/gpu-operator # kubectl get po NAME READY STATUS RESTARTS AGE gpu-feature-discovery-9t6sp 0/1 Init:RunContainerError 3 61s gpu-node-feature-discovery-master-64864bd756-fthcl 1/1 Running 0 84s gpu-node-feature-discovery-worker-2l2l6 1/1 Running 0 84s gpu-node-feature-discovery-worker-4wnzw 1/1 Running 0 84s gpu-node-feature-discovery-worker-8bhmc 1/1 Running 0 84s gpu-node-feature-discovery-worker-8vp78 1/1 Running 0 84s gpu-node-feature-discovery-worker-k5jjd 1/1 Running 0 84s gpu-node-feature-discovery-worker-lfbgd 1/1 Running 0 84s gpu-node-feature-discovery-worker-sj9m9 1/1 Running 0 84s gpu-node-feature-discovery-worker-twbms 1/1 Running 0 84s gpu-node-feature-discovery-worker-zfkqt 1/1 Running 0 84s gpu-operator-7bdd8bf555-pvxz5 1/1 Running 0 84s nvidia-container-toolkit-daemonset-2hdpv 1/1 Running 0 63s nvidia-dcgm-exporter-2xlz8 0/1 Init:RunContainerError 3 62s nvidia-device-plugin-daemonset-nmb7r 0/1 Init:RunContainerError 3 63s nvidia-operator-validator-v8xn9 0/1 Init:RunContainerError 3 63s

  • [ ] kubernetes daemonset status: kubectl get ds --all-namespaces

  • [ ] If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME

POD Events: gpu-feature-discovery-9t6sp

Events: Type Reason Age From Message


Normal Scheduled 98s default-scheduler Successfully assigned default/gpu-feature-discovery-9t6sp to jelly-oct13-2022-sample-gpu-pool2-jelly-oct13-2022-94b0-qh26p Warning FailedCreatePodSandBox 98s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured Normal Pulled 44s (x4 over 84s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on machine Normal Created 44s (x4 over 84s) kubelet Created container toolkit-validation Warning Failed 44s (x4 over 84s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 4s (x8 over 83s) kubelet Back-off restarting failed container

POD: nvidia-operator-validator-v8xn9

Events: Type Reason Age From Message


Normal Scheduled 3m8s default-scheduler Successfully assigned default/nvidia-operator-validator-v8xn9 to jelly-oct13-2022-sample-gpu-pool2-jelly-oct13-2022-94b0-qh26p Warning FailedCreatePodSandBox 3m7s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured Normal Pulled 85s (x5 over 2m54s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on machine Normal Created 85s (x5 over 2m54s) kubelet Created container driver-validation Warning Failed 85s (x5 over 2m53s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 73s (x9 over 2m52s) kubelet Back-off restarting failed container

POD: nvidia-device-plugin-daemonset-nmb7r

Events: Type Reason Age From Message


Normal Scheduled 2m56s default-scheduler Successfully assigned default/nvidia-device-plugin-daemonset-nmb7r to jelly-oct13-2022-sample-gpu-pool2-jelly-oct13-2022-94b0-qh26p Warning FailedCreatePodSandBox 2m55s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured Normal Pulled 74s (x5 over 2m42s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on machine Normal Created 74s (x5 over 2m42s) kubelet Created container toolkit-validation Warning Failed 74s (x5 over 2m42s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 63s (x9 over 2m41s) kubelet Back-off restarting failed container

POD: nvidia-dcgm-exporter-2xlz8

Events: Type Reason Age From Message


Normal Scheduled 2m47s default-scheduler Successfully assigned default/nvidia-dcgm-exporter-2xlz8 to jelly-oct13-2022-sample-gpu-pool2-jelly-oct13-2022-94b0-qh26p Warning FailedCreatePodSandBox 2m46s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured Normal Pulled 58s (x5 over 2m35s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on machine Normal Created 58s (x5 over 2m35s) kubelet Created container toolkit-validation Warning Failed 58s (x5 over 2m35s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 58s (x9 over 2m34s) kubelet Back-off restarting failed container

  • [ ] If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME

  • [ ] Output of running a container on the GPU machine: docker run -it alpine echo foo

  • [ ] Docker configuration file: cat /etc/docker/daemon.json

  • [ ] Docker runtime configuration: docker info | grep runtime

default_runtime_name = "nvidia" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia] runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options] BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental] runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental.options] BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]

  • [ ] NVIDIA shared directory: ls -la /run/nvidia

total 4 drwxr-xr-x 4 root root 100 Nov 17 03:23 . drwxr-xr-x 33 root root 880 Nov 17 03:17 .. drwxr-xr-x 2 root root 40 Nov 17 03:17 driver -rw-r--r-- 1 root root 6 Nov 17 03:23 toolkit.pid drwxr-xr-x 2 root root 60 Nov 17 03:22 validations

  • [ ] NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit total 12912 drwxr-xr-x 3 root root 4096 Nov 17 03:23 . drwxr-xr-x 3 root root 21 Nov 17 03:23 .. drwxr-xr-x 3 root root 38 Nov 17 03:23 .config lrwxrwxrwx 1 root root 32 Nov 17 03:23 libnvidia-container-go.so.1 -> libnvidia-container-go.so.1.11.0 -rw-r--r-- 1 root root 2959384 Nov 17 03:23 libnvidia-container-go.so.1.11.0 lrwxrwxrwx 1 root root 29 Nov 17 03:23 libnvidia-container.so.1 -> libnvidia-container.so.1.11.0 -rwxr-xr-x 1 root root 195856 Nov 17 03:23 libnvidia-container.so.1.11.0 -rwxr-xr-x 1 root root 154 Nov 17 03:23 nvidia-container-cli -rwxr-xr-x 1 root root 47472 Nov 17 03:23 nvidia-container-cli.real -rwxr-xr-x 1 root root 342 Nov 17 03:23 nvidia-container-runtime -rwxr-xr-x 1 root root 350 Nov 17 03:23 nvidia-container-runtime-experimental -rwxr-xr-x 1 root root 203 Nov 17 03:23 nvidia-container-runtime-hook -rwxr-xr-x 1 root root 2142088 Nov 17 03:23 nvidia-container-runtime-hook.real -rwxr-xr-x 1 root root 3771792 Nov 17 03:23 nvidia-container-runtime.experimental -rwxr-xr-x 1 root root 4079040 Nov 17 03:23 nvidia-container-runtime.real lrwxrwxrwx 1 root root 29 Nov 17 03:23 nvidia-container-toolkit -> nvidia-container-runtime-hook

  • [ ] NVIDIA driver directory: ls -la /run/nvidia/driver total 0 drwxr-xr-x 2 root root 40 Nov 17 03:17 . drwxr-xr-x 4 root root 100 Nov 17 03:23 ..

Note: nvidia-computeG05-470.129.06-150400.54.1.x86_64 Driver installed as an RPM on the worker node.

Result of nvidia-smi: Thu Nov 17 03:31:09 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:00:08.0 Off | 0 | | N/A 40C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla T4 Off | 00000000:00:09.0 Off | 0 | | N/A 40C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

  • [ ] kubelet logs journalctl -u kubelet > kubelet.logs

ATP-55 avatar Nov 17 '22 03:11 ATP-55

@Amrutayan can you describe nvidia-container-toolkit-daemonset to see what image is being used?

cdesiniotis avatar Nov 18 '22 18:11 cdesiniotis

I have used gpu-operator V22.9. So using image:

repository: nvcr.io/nvidia/k8s image: container-toolkit version: v1.11.0-ubuntu20.04

ATP-55 avatar Nov 21 '22 08:11 ATP-55

@Amrutayan Can you use the v1.11.0-ubi8 toolkit image instead? Please see the discussion here.

shivamerla avatar Nov 21 '22 16:11 shivamerla

I have tried but still error remain the same:

Containers: nvidia-container-toolkit-ctr: Container ID: containerd://502006a8772e498c6ba4f874fdacee353208489ff114b790cf4d82cc4334b7c9 Image: nvcr.io/nvidia/k8s/container-toolkit:v1.11.0-ubi8 Image ID: nvcr.io/nvidia/k8s/container-toolkit@sha256:efb88937f73434994d1bbadc87b492a1df047aa9f8d6e9f5ec3b09536e6e7691 Port: Host Port: Command: bash -c Args: [[ -f /run/nvidia/validations/host-driver-ready ]] && driver_root=/ || driver_root=/run/nvidia/driver; export NVIDIA_DRIVER_ROOT=$driver_root; exec nvidia-toolkit /usr/local/nvidia State: Running Started: Thu, 24 Nov 2022 13:45:20 +0000

Logs: NAME READY STATUS RESTARTS AGE gpu-feature-discovery-mfvcd 0/1 Init:CrashLoopBackOff 4 3m54s gpu-node-feature-discovery-master-64864bd756-5skpq 1/1 Running 0 20h gpu-node-feature-discovery-worker-7w8pp 1/1 Running 0 20h gpu-node-feature-discovery-worker-c8tfr 1/1 Running 0 20h gpu-node-feature-discovery-worker-ms6q5 1/1 Running 0 20h gpu-node-feature-discovery-worker-qnkdr 1/1 Running 0 20h gpu-node-feature-discovery-worker-rngnf 1/1 Running 0 20h gpu-node-feature-discovery-worker-t6d6z 1/1 Running 1 7h13m gpu-node-feature-discovery-worker-ws25n 1/1 Running 0 20h gpu-node-feature-discovery-worker-zz8rp 1/1 Running 0 20h gpu-operator-7bdd8bf555-kcfhp 1/1 Running 0 20h nvidia-container-toolkit-daemonset-wkssm 1/1 Running 0 4m8s nvidia-dcgm-exporter-fqpv7 0/1 Init:CrashLoopBackOff 4 3m56s nvidia-device-plugin-daemonset-4bvwr 0/1 Init:CrashLoopBackOff 4 4m8s nvidia-operator-validator-zfm5z 0/1 Init:CrashLoopBackOff 4 3m54s

kubectl describe po nvidia-dcgm-exporter-fqpv7

d to get sandbox runtime: no runtime for "nvidia" is configured Normal Pulled 3m2s (x4 over 3m44s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on machine Normal Created 3m2s (x4 over 3m44s) kubelet Created container toolkit-validation Warning Failed 3m1s (x4 over 3m44s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 30s (x17 over 3m43s) kubelet Back-off restarting failed container

kubectl describe po nvidia-device-plugin-daemonset-4bvwr

Warning FailedCreatePodSandBox 4m55s kubelet Failed to create pod sandbox: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused" Warning Failed 4m22s (x3 over 4m39s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Normal Pulled 3m56s (x4 over 4m39s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0" already present on machine Normal Created 3m56s (x4 over 4m39s) kubelet Created container toolkit-validation Warning BackOff 92s (x16 over 4m38s) kubelet Back-off restarting failed container

kubectl describe po gpu-feature-discovery-mfvcd

Normal Created 4m48s (x4 over 5m30s) kubelet Created container driver-validation Warning Failed 4m47s (x4 over 5m30s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 2m10s (x17 over 5m29s) kubelet Back-off restarting failed container

kubectl describe po nvidia-operator-validator-zfm5z

Normal Created 5m23s (x4 over 6m5s) kubelet Created container toolkit-validation Warning Failed 5m22s (x4 over 6m5s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown Warning BackOff 2m50s (x17 over 6m3s) kubelet Back-off restarting failed container

ATP-55 avatar Nov 24 '22 13:11 ATP-55

can you please take a look and suggest please?

ATP-55 avatar Dec 05 '22 06:12 ATP-55

@cdesiniotis Can you please suggest?

ATP-55 avatar Dec 08 '22 13:12 ATP-55

Can you check if the driver root is set correctly to / in this case in /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml?

shivamerla avatar Dec 08 '22 14:12 shivamerla

I don't know if that helps your case, but I had the same error and increasing pod's memory request and limits to at least 1G solved the issue.

cmisale avatar Jan 13 '23 19:01 cmisale

I've encountered Auto-detected mode as 'legacy' when accidentally specifying a device that did not exist.

figuernd avatar Oct 22 '24 18:10 figuernd

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Nov 05 '25 00:11 github-actions[bot]

This issue has been open for over 90 days without recent updates, and the context may now be outdated.

Given that gpu-operator v22.9.0 is EOL now, I would encourage you to try latest version and see if you still see this issue.

If this issue is still relevant with the latest version of the NVIDIA GPU Operator, please feel free to reopen it or open a new one with updated details.

cdesiniotis avatar Nov 14 '25 00:11 cdesiniotis