gpu-operator
gpu-operator copied to clipboard
chroot: failed to run command 'nvidia-smi': No such file or directory
OS environment information
# cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
# uname -a
Linux a100 5.15.0-107-generic #117~20.04.1-Ubuntu SMP Tue Apr 30 10:35:57 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
###GPU environment information
# dpkg -l | grep nvidia
ii libnvidia-cfg1-560:amd64 560.35.03-0ubuntu1 amd64 NVIDIA binary OpenGL/GLX configuration library
ii libnvidia-common-560 560.35.03-0ubuntu1 all Shared files used by the NVIDIA libraries
rc libnvidia-compute-515:amd64 515.65.01-0ubuntu1 amd64 NVIDIA libcompute package
rc libnvidia-compute-525:amd64 525.147.05-0ubuntu2.20.04.1 amd64 NVIDIA libcompute package (transitional package)
rc libnvidia-compute-535:amd64 535.171.04-0ubuntu0.20.04.1 amd64 NVIDIA libcompute package
ii libnvidia-compute-560:amd64 560.35.03-0ubuntu1 amd64 NVIDIA libcompute package
ii libnvidia-container-tools 1.16.1-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.16.1-1 amd64 NVIDIA container runtime library
ii libnvidia-decode-560:amd64 560.35.03-0ubuntu1 amd64 NVIDIA Video Decoding runtime libraries
ii libnvidia-encode-560:amd64 560.35.03-0ubuntu1 amd64 NVENC Video Encoding runtime library
ii libnvidia-extra-560:amd64 560.35.03-0ubuntu1 amd64 Extra libraries for the NVIDIA driver
ii libnvidia-fbc1-560:amd64 560.35.03-0ubuntu1 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-gl-560:amd64 560.35.03-0ubuntu1 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
rc nvidia-compute-utils-535 535.171.04-0ubuntu0.20.04.1 amd64 NVIDIA compute utilities
ii nvidia-compute-utils-560 560.35.03-0ubuntu1 amd64 NVIDIA compute utilities
ii nvidia-container-runtime 3.14.0-1 all NVIDIA Container Toolkit meta-package
ii nvidia-container-toolkit 1.16.1-1 amd64 NVIDIA Container toolkit
ii nvidia-container-toolkit-base 1.16.1-1 amd64 NVIDIA Container Toolkit Base
rc nvidia-dkms-535 535.171.04-0ubuntu0.20.04.1 amd64 NVIDIA DKMS package
ii nvidia-dkms-560 560.35.03-0ubuntu1 amd64 NVIDIA DKMS package
ii nvidia-docker2 2.14.0-1 all NVIDIA Container Toolkit meta-package
ii nvidia-driver-560 560.35.03-0ubuntu1 amd64 NVIDIA driver metapackage
ii nvidia-driver-local-repo-ubuntu2004-560.35.03 1.0-1 amd64 nvidia-driver-local repository configuration files
ii nvidia-firmware-535-535.171.04 535.171.04-0ubuntu0.20.04.1 amd64 Firmware files used by the kernel module
ii nvidia-firmware-560-560.35.03 560.35.03-0ubuntu1 amd64 Firmware files used by the kernel module
rc nvidia-kernel-common-535 535.171.04-0ubuntu0.20.04.1 amd64 Shared files used with the kernel module
ii nvidia-kernel-common-560 560.35.03-0ubuntu1 amd64 Shared files used with the kernel module
ii nvidia-kernel-source-560 560.35.03-0ubuntu1 amd64 NVIDIA kernel source package
ii nvidia-prime 0.8.16~0.20.04.2 all Tools to enable NVIDIA's Prime
ii nvidia-settings 515.65.01-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
ii nvidia-utils-560 560.35.03-0ubuntu1 amd64 NVIDIA driver support binaries
ii screen-resolution-extra 0.18build1 all Extension for the nvidia-settings control panel
ii xserver-xorg-video-nvidia-560 560.35.03-0ubuntu1 amd64 NVIDIA binary Xorg driver
# nvidia-smi
Thu Oct 24 09:19:00 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-PCIE-40GB Off | 00000000:41:00.0 Off | 0 |
| N/A 32C P0 36W / 250W | 3028MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-PCIE-40GB Off | 00000000:C1:00.0 Off | 0 |
| N/A 34C P0 36W / 250W | 17MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2221 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 5347 C python 998MiB |
| 0 N/A N/A 6150 C /opt/conda/bin/python 998MiB |
| 0 N/A N/A 6151 C /opt/conda/bin/python 998MiB |
| 1 N/A N/A 2221 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------------------+
the installtion step of the GPU Operator
# kubectl create ns gpu-operator
namespace/gpu-operator created
# helm install --kubeconfig=/var/lib/secctr/k3s/server/cred/admin.kubeconfig gpu-operator -n gpu-operator . --values values.yaml
NAME: gpu-operator
LAST DEPLOYED: Thu Oct 24 09:19:31 2024
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None
# helm list -n gpu-operator
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
gpu-operator gpu-operator 1 2024-10-24 09:19:31.372209469 +0800 CST deployed gpu-operator-v24.6.2 v24.6.2
# kubectl get all -n gpu-operator
NAME READY STATUS RESTARTS AGE
pod/gpu-feature-discovery-z4v8q 0/1 Init:0/1 0 89s
pod/gpu-operator-7d66589d9b-rkqrm 1/1 Running 0 93s
pod/gpu-operator-node-feature-discovery-gc-7478549676-zzr66 1/1 Running 0 93s
pod/gpu-operator-node-feature-discovery-master-67769784f5-7pqvj 1/1 Running 0 93s
pod/gpu-operator-node-feature-discovery-worker-6bwfw 1/1 Running 0 93s
pod/nvidia-container-toolkit-daemonset-lkqz9 0/1 Init:0/1 0 90s
pod/nvidia-dcgm-exporter-lxjqv 0/1 Init:0/1 0 89s
pod/nvidia-device-plugin-daemonset-5tncc 0/1 Init:0/1 0 89s
pod/nvidia-operator-validator-jctzl 0/1 Init:0/4 0 90s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/gpu-operator ClusterIP 10.43.21.239 <none> 8080/TCP 91s
service/nvidia-dcgm-exporter ClusterIP 10.43.126.172 <none> 9400/TCP 90s
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 89s
daemonset.apps/gpu-operator-node-feature-discovery-worker 1 1 1 1 1 <none> 93s
daemonset.apps/nvidia-container-toolkit-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.container-toolkit=true 90s
daemonset.apps/nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 89s
daemonset.apps/nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 90s
daemonset.apps/nvidia-device-plugin-mps-control-daemon 0 0 0 0 0 nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true 89s
daemonset.apps/nvidia-driver-daemonset 0 0 0 0 0 nvidia.com/gpu.deploy.driver=true 90s
daemonset.apps/nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 89s
daemonset.apps/nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 90s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/gpu-operator 1/1 1 1 93s
deployment.apps/gpu-operator-node-feature-discovery-gc 1/1 1 1 93s
deployment.apps/gpu-operator-node-feature-discovery-master 1/1 1 1 93s
NAME DESIRED CURRENT READY AGE
replicaset.apps/gpu-operator-7d66589d9b 1 1 1 93s
replicaset.apps/gpu-operator-node-feature-discovery-gc-7478549676 1 1 1 93s
replicaset.apps/gpu-operator-node-feature-discovery-master-67769784f5 1 1 1 93s
# crictl image ls | grep -e nvidia -e vgpu -e nfd
nvcr.io/nvidia/cloud-native/dcgm 3.3.7-1-ubuntu22.04 292733d61a20b 1.99GB
nvcr.io/nvidia/cloud-native/gpu-operator-validator latest 8371f914ffba3 324MB
nvcr.io/nvidia/cloud-native/gpu-operator-validator v24.6.2 8371f914ffba3 324MB
nvcr.io/nvidia/cloud-native/k8s-cc-manager v0.1.1 c5006389d56b3 647MB
nvcr.io/nvidia/cloud-native/k8s-driver-manager v0.6.10 dd9cff3ea5509 590MB
nvcr.io/nvidia/cloud-native/k8s-kata-manager v0.2.1 082572359e199 449MB
nvcr.io/nvidia/cloud-native/vgpu-device-manager v0.2.7 9f7a380f3f3e0 419MB
nvcr.io/nvidia/cuda 12.6.1-base-ubi8 103c9a2598a96 389MB
nvcr.io/nvidia/driver 550.90.07 62f8c7903995a 1.17GB
nvcr.io/nvidia/gpu-operator v24.6.2 d57aeeb1c5a37 623MB
nvcr.io/nvidia/k8s-device-plugin v0.16.2-ubi8 44edb05883259 505MB
nvcr.io/nvidia/k8s/container-toolkit v1.16.2-ubuntu20.04 bdcc66b183991 350MB
nvcr.io/nvidia/k8s/dcgm-exporter 3.3.7-3.5.0-ubuntu22.04 ee8c6dfbf28aa 350MB
nvcr.io/nvidia/kubevirt-gpu-device-plugin v1.2.9 3b4407d30d0d6 415MB
registry.k8s.io/nfd/node-feature-discovery v0.16.3 bc292d823f05c 226MB
error information of driver-validation pod
2024-10-24T09:25:48.176690024+08:00 running command chroot with args [/run/nvidia/driver nvidia-smi]
2024-10-24T09:25:48.178993123+08:00 chroot: failed to run command 'nvidia-smi': No such file or directory
2024-10-24T09:25:48.179246578+08:00 command failed, retrying after 5 seconds
2024-10-24T09:25:53.179544295+08:00 running command chroot with args [/run/nvidia/driver nvidia-smi]
2024-10-24T09:25:53.181845219+08:00 chroot: failed to run command 'nvidia-smi': No such file or directory
2024-10-24T09:25:53.182124683+08:00 command failed, retrying after 5 seconds
2024-10-24T09:25:58.182391151+08:00 running command chroot with args [/run/nvidia/driver nvidia-smi]
2024-10-24T09:25:58.184547885+08:00 chroot: failed to run command 'nvidia-smi': No such file or directory
2024-10-24T09:25:58.184862084+08:00 command failed, retrying after 5 seconds
2024-10-24T09:26:03.185091032+08:00 running command chroot with args [/run/nvidia/driver nvidia-smi]
2024-10-24T09:26:03.187436039+08:00 chroot: failed to run command 'nvidia-smi': No such file or directory
2024-10-24T09:26:03.187669887+08:00 command failed, retrying after 5 seconds
2024-10-24T09:26:08.187985477+08:00 running command chroot with args [/run/nvidia/driver nvidia-smi]
2024-10-24T09:26:08.190198327+08:00 chroot: failed to run command 'nvidia-smi': No such file or directory
2024-10-24T09:26:08.190459947+08:00 command failed, retrying after 5 seconds
2024-10-24T09:26:13.190717358+08:00 running command chroot with args [/run/nvidia/driver nvidia-smi]
2024-10-24T09:26:13.192869694+08:00 chroot: failed to run command 'nvidia-smi': No such file or directory
2024-10-24T09:26:13.193131254+08:00 command failed, retrying after 5 seconds
K8S event info
# kubectl get event -n gpu-operator
LAST SEEN TYPE REASON OBJECT MESSAGE
12m Normal LeaderElection lease/53822513.nvidia.com gpu-operator-7d66589d9b-rkqrm_4bb2ab72-3969-4020-87e4-1704ded2e72d became leader
12m Normal Scheduled pod/gpu-feature-discovery-z4v8q Successfully assigned gpu-operator/gpu-feature-discovery-z4v8q to de9e0472.secctr.com
12m Normal Pulled pod/gpu-feature-discovery-z4v8q Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.6.2" already present on machine
12m Normal Created pod/gpu-feature-discovery-z4v8q Created container toolkit-validation
12m Normal Started pod/gpu-feature-discovery-z4v8q Started container toolkit-validation
12m Normal SuccessfulCreate daemonset/gpu-feature-discovery Created pod: gpu-feature-discovery-z4v8q
12m Normal Scheduled pod/gpu-operator-7d66589d9b-rkqrm Successfully assigned gpu-operator/gpu-operator-7d66589d9b-rkqrm to de9e0472.secctr.com
12m Normal Pulled pod/gpu-operator-7d66589d9b-rkqrm Container image "nvcr.io/nvidia/gpu-operator:v24.6.2" already present on machine
12m Normal Created pod/gpu-operator-7d66589d9b-rkqrm Created container gpu-operator
12m Normal Started pod/gpu-operator-7d66589d9b-rkqrm Started container gpu-operator
12m Normal SuccessfulCreate replicaset/gpu-operator-7d66589d9b Created pod: gpu-operator-7d66589d9b-rkqrm
12m Normal Scheduled pod/gpu-operator-node-feature-discovery-gc-7478549676-zzr66 Successfully assigned gpu-operator/gpu-operator-node-feature-discovery-gc-7478549676-zzr66 to de9e0472.secctr.com
12m Normal Pulled pod/gpu-operator-node-feature-discovery-gc-7478549676-zzr66 Container image "registry.k8s.io/nfd/node-feature-discovery:v0.16.3" already present on machine
12m Normal Created pod/gpu-operator-node-feature-discovery-gc-7478549676-zzr66 Created container gc
12m Normal Started pod/gpu-operator-node-feature-discovery-gc-7478549676-zzr66 Started container gc
12m Normal SuccessfulCreate replicaset/gpu-operator-node-feature-discovery-gc-7478549676 Created pod: gpu-operator-node-feature-discovery-gc-7478549676-zzr66
12m Normal ScalingReplicaSet deployment/gpu-operator-node-feature-discovery-gc Scaled up replica set gpu-operator-node-feature-discovery-gc-7478549676 to 1
12m Normal Scheduled pod/gpu-operator-node-feature-discovery-master-67769784f5-7pqvj Successfully assigned gpu-operator/gpu-operator-node-feature-discovery-master-67769784f5-7pqvj to de9e0472.secctr.com
12m Normal Pulled pod/gpu-operator-node-feature-discovery-master-67769784f5-7pqvj Container image "registry.k8s.io/nfd/node-feature-discovery:v0.16.3" already present on machine
12m Normal Created pod/gpu-operator-node-feature-discovery-master-67769784f5-7pqvj Created container master
12m Normal Started pod/gpu-operator-node-feature-discovery-master-67769784f5-7pqvj Started container master
12m Normal SuccessfulCreate replicaset/gpu-operator-node-feature-discovery-master-67769784f5 Created pod: gpu-operator-node-feature-discovery-master-67769784f5-7pqvj
12m Normal ScalingReplicaSet deployment/gpu-operator-node-feature-discovery-master Scaled up replica set gpu-operator-node-feature-discovery-master-67769784f5 to 1
12m Normal Scheduled pod/gpu-operator-node-feature-discovery-worker-6bwfw Successfully assigned gpu-operator/gpu-operator-node-feature-discovery-worker-6bwfw to de9e0472.secctr.com
12m Normal Pulled pod/gpu-operator-node-feature-discovery-worker-6bwfw Container image "registry.k8s.io/nfd/node-feature-discovery:v0.16.3" already present on machine
12m Normal Created pod/gpu-operator-node-feature-discovery-worker-6bwfw Created container worker
12m Normal Started pod/gpu-operator-node-feature-discovery-worker-6bwfw Started container worker
12m Normal SuccessfulCreate daemonset/gpu-operator-node-feature-discovery-worker Created pod: gpu-operator-node-feature-discovery-worker-6bwfw
12m Normal ScalingReplicaSet deployment/gpu-operator Scaled up replica set gpu-operator-7d66589d9b to 1
12m Normal Scheduled pod/nvidia-container-toolkit-daemonset-lkqz9 Successfully assigned gpu-operator/nvidia-container-toolkit-daemonset-lkqz9 to de9e0472.secctr.com
12m Normal Pulled pod/nvidia-container-toolkit-daemonset-lkqz9 Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.6.2" already present on machine
12m Normal Created pod/nvidia-container-toolkit-daemonset-lkqz9 Created container driver-validation
12m Normal Started pod/nvidia-container-toolkit-daemonset-lkqz9 Started container driver-validation
12m Normal SuccessfulCreate daemonset/nvidia-container-toolkit-daemonset Created pod: nvidia-container-toolkit-daemonset-lkqz9
12m Normal Scheduled pod/nvidia-dcgm-exporter-lxjqv Successfully assigned gpu-operator/nvidia-dcgm-exporter-lxjqv to de9e0472.secctr.com
12m Normal Pulled pod/nvidia-dcgm-exporter-lxjqv Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.6.2" already present on machine
12m Normal Created pod/nvidia-dcgm-exporter-lxjqv Created container toolkit-validation
12m Normal Started pod/nvidia-dcgm-exporter-lxjqv Started container toolkit-validation
12m Normal SuccessfulCreate daemonset/nvidia-dcgm-exporter Created pod: nvidia-dcgm-exporter-lxjqv
12m Normal Scheduled pod/nvidia-device-plugin-daemonset-5tncc Successfully assigned gpu-operator/nvidia-device-plugin-daemonset-5tncc to de9e0472.secctr.com
12m Normal Pulled pod/nvidia-device-plugin-daemonset-5tncc Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.6.2" already present on machine
12m Normal Created pod/nvidia-device-plugin-daemonset-5tncc Created container toolkit-validation
12m Normal Started pod/nvidia-device-plugin-daemonset-5tncc Started container toolkit-validation
12m Normal SuccessfulCreate daemonset/nvidia-device-plugin-daemonset Created pod: nvidia-device-plugin-daemonset-5tncc
12m Normal Scheduled pod/nvidia-driver-daemonset-vpvv6 Successfully assigned gpu-operator/nvidia-driver-daemonset-vpvv6 to de9e0472.secctr.com
12m Normal Pulled pod/nvidia-driver-daemonset-vpvv6 Container image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.10" already present on machine
12m Normal Created pod/nvidia-driver-daemonset-vpvv6 Created container k8s-driver-manager
12m Normal Started pod/nvidia-driver-daemonset-vpvv6 Started container k8s-driver-manager
12m Normal Killing pod/nvidia-driver-daemonset-vpvv6 Stopping container k8s-driver-manager
12m Normal SuccessfulCreate daemonset/nvidia-driver-daemonset Created pod: nvidia-driver-daemonset-vpvv6
12m Normal SuccessfulDelete daemonset/nvidia-driver-daemonset Deleted pod: nvidia-driver-daemonset-vpvv6
12m Normal Scheduled pod/nvidia-operator-validator-jctzl Successfully assigned gpu-operator/nvidia-operator-validator-jctzl to de9e0472.secctr.com
12m Normal Pulled pod/nvidia-operator-validator-jctzl Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.6.2" already present on machine
12m Normal Created pod/nvidia-operator-validator-jctzl Created container driver-validation
12m Normal Started pod/nvidia-operator-validator-jctzl Started container driver-validation
12m Normal SuccessfulCreate daemonset/nvidia-operator-validator Created pod: nvidia-operator-validator-jctzl
infomation of /run/nvidia/
# ls /run/nvidia/
driver mps toolkit validations
# ls /run/nvidia/driver/
lib
# ls /run/nvidia/driver/lib/
firmware
# ls /run/nvidia/driver/lib/firmware/
# ls /run/nvidia/mps/
# ls /run/nvidia/toolkit/
# ls /run/nvidia/validations/
#