gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

chroot: failed to run command 'nvidia-smi': No such file or directory

Open vanloswang opened this issue 1 year ago • 0 comments

OS environment information

# cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

# uname -a
Linux a100 5.15.0-107-generic #117~20.04.1-Ubuntu SMP Tue Apr 30 10:35:57 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

###GPU environment information

# dpkg -l | grep nvidia
ii  libnvidia-cfg1-560:amd64                      560.35.03-0ubuntu1                   amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-common-560                          560.35.03-0ubuntu1                   all          Shared files used by the NVIDIA libraries
rc  libnvidia-compute-515:amd64                   515.65.01-0ubuntu1                   amd64        NVIDIA libcompute package
rc  libnvidia-compute-525:amd64                   525.147.05-0ubuntu2.20.04.1          amd64        NVIDIA libcompute package (transitional package)
rc  libnvidia-compute-535:amd64                   535.171.04-0ubuntu0.20.04.1          amd64        NVIDIA libcompute package
ii  libnvidia-compute-560:amd64                   560.35.03-0ubuntu1                   amd64        NVIDIA libcompute package
ii  libnvidia-container-tools                     1.16.1-1                             amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64                    1.16.1-1                             amd64        NVIDIA container runtime library
ii  libnvidia-decode-560:amd64                    560.35.03-0ubuntu1                   amd64        NVIDIA Video Decoding runtime libraries
ii  libnvidia-encode-560:amd64                    560.35.03-0ubuntu1                   amd64        NVENC Video Encoding runtime library
ii  libnvidia-extra-560:amd64                     560.35.03-0ubuntu1                   amd64        Extra libraries for the NVIDIA driver
ii  libnvidia-fbc1-560:amd64                      560.35.03-0ubuntu1                   amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-gl-560:amd64                        560.35.03-0ubuntu1                   amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
rc  nvidia-compute-utils-535                      535.171.04-0ubuntu0.20.04.1          amd64        NVIDIA compute utilities
ii  nvidia-compute-utils-560                      560.35.03-0ubuntu1                   amd64        NVIDIA compute utilities
ii  nvidia-container-runtime                      3.14.0-1                             all          NVIDIA Container Toolkit meta-package
ii  nvidia-container-toolkit                      1.16.1-1                             amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base                 1.16.1-1                             amd64        NVIDIA Container Toolkit Base
rc  nvidia-dkms-535                               535.171.04-0ubuntu0.20.04.1          amd64        NVIDIA DKMS package
ii  nvidia-dkms-560                               560.35.03-0ubuntu1                   amd64        NVIDIA DKMS package
ii  nvidia-docker2                                2.14.0-1                             all          NVIDIA Container Toolkit meta-package
ii  nvidia-driver-560                             560.35.03-0ubuntu1                   amd64        NVIDIA driver metapackage
ii  nvidia-driver-local-repo-ubuntu2004-560.35.03 1.0-1                                amd64        nvidia-driver-local repository configuration files
ii  nvidia-firmware-535-535.171.04                535.171.04-0ubuntu0.20.04.1          amd64        Firmware files used by the kernel module
ii  nvidia-firmware-560-560.35.03                 560.35.03-0ubuntu1                   amd64        Firmware files used by the kernel module
rc  nvidia-kernel-common-535                      535.171.04-0ubuntu0.20.04.1          amd64        Shared files used with the kernel module
ii  nvidia-kernel-common-560                      560.35.03-0ubuntu1                   amd64        Shared files used with the kernel module
ii  nvidia-kernel-source-560                      560.35.03-0ubuntu1                   amd64        NVIDIA kernel source package
ii  nvidia-prime                                  0.8.16~0.20.04.2                     all          Tools to enable NVIDIA's Prime
ii  nvidia-settings                               515.65.01-0ubuntu1                   amd64        Tool for configuring the NVIDIA graphics driver
ii  nvidia-utils-560                              560.35.03-0ubuntu1                   amd64        NVIDIA driver support binaries
ii  screen-resolution-extra                       0.18build1                           all          Extension for the nvidia-settings control panel
ii  xserver-xorg-video-nvidia-560                 560.35.03-0ubuntu1                   amd64        NVIDIA binary Xorg driver

# nvidia-smi
Thu Oct 24 09:19:00 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          Off |   00000000:41:00.0 Off |                    0 |
| N/A   32C    P0             36W /  250W |    3028MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          Off |   00000000:C1:00.0 Off |                    0 |
| N/A   34C    P0             36W /  250W |      17MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      2221      G   /usr/lib/xorg/Xorg                              4MiB |
|    0   N/A  N/A      5347      C   python                                        998MiB |
|    0   N/A  N/A      6150      C   /opt/conda/bin/python                         998MiB |
|    0   N/A  N/A      6151      C   /opt/conda/bin/python                         998MiB |
|    1   N/A  N/A      2221      G   /usr/lib/xorg/Xorg                              4MiB |
+-----------------------------------------------------------------------------------------+

the installtion step of the GPU Operator

# kubectl create ns gpu-operator
namespace/gpu-operator created

# helm install --kubeconfig=/var/lib/secctr/k3s/server/cred/admin.kubeconfig gpu-operator -n gpu-operator . --values values.yaml
NAME: gpu-operator
LAST DEPLOYED: Thu Oct 24 09:19:31 2024
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None

# helm list -n gpu-operator
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
gpu-operator    gpu-operator    1               2024-10-24 09:19:31.372209469 +0800 CST deployed        gpu-operator-v24.6.2    v24.6.2

# kubectl get all -n gpu-operator
NAME                                                              READY   STATUS     RESTARTS   AGE
pod/gpu-feature-discovery-z4v8q                                   0/1     Init:0/1   0          89s
pod/gpu-operator-7d66589d9b-rkqrm                                 1/1     Running    0          93s
pod/gpu-operator-node-feature-discovery-gc-7478549676-zzr66       1/1     Running    0          93s
pod/gpu-operator-node-feature-discovery-master-67769784f5-7pqvj   1/1     Running    0          93s
pod/gpu-operator-node-feature-discovery-worker-6bwfw              1/1     Running    0          93s
pod/nvidia-container-toolkit-daemonset-lkqz9                      0/1     Init:0/1   0          90s
pod/nvidia-dcgm-exporter-lxjqv                                    0/1     Init:0/1   0          89s
pod/nvidia-device-plugin-daemonset-5tncc                          0/1     Init:0/1   0          89s
pod/nvidia-operator-validator-jctzl                               0/1     Init:0/4   0          90s

NAME                           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/gpu-operator           ClusterIP   10.43.21.239    <none>        8080/TCP   91s
service/nvidia-dcgm-exporter   ClusterIP   10.43.126.172   <none>        9400/TCP   90s

NAME                                                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                          AGE
daemonset.apps/gpu-feature-discovery                        1         1         0       1            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true                       89s
daemonset.apps/gpu-operator-node-feature-discovery-worker   1         1         1       1            1           <none>                                                                 93s
daemonset.apps/nvidia-container-toolkit-daemonset           1         1         0       1            0           nvidia.com/gpu.deploy.container-toolkit=true                           90s
daemonset.apps/nvidia-dcgm-exporter                         1         1         0       1            0           nvidia.com/gpu.deploy.dcgm-exporter=true                               89s
daemonset.apps/nvidia-device-plugin-daemonset               1         1         0       1            0           nvidia.com/gpu.deploy.device-plugin=true                               90s
daemonset.apps/nvidia-device-plugin-mps-control-daemon      0         0         0       0            0           nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true   89s
daemonset.apps/nvidia-driver-daemonset                      0         0         0       0            0           nvidia.com/gpu.deploy.driver=true                                      90s
daemonset.apps/nvidia-mig-manager                           0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true                                 89s
daemonset.apps/nvidia-operator-validator                    1         1         0       1            0           nvidia.com/gpu.deploy.operator-validator=true                          90s

NAME                                                         READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/gpu-operator                                 1/1     1            1           93s
deployment.apps/gpu-operator-node-feature-discovery-gc       1/1     1            1           93s
deployment.apps/gpu-operator-node-feature-discovery-master   1/1     1            1           93s

NAME                                                                    DESIRED   CURRENT   READY   AGE
replicaset.apps/gpu-operator-7d66589d9b                                 1         1         1       93s
replicaset.apps/gpu-operator-node-feature-discovery-gc-7478549676       1         1         1       93s
replicaset.apps/gpu-operator-node-feature-discovery-master-67769784f5   1         1         1       93s

# crictl image ls | grep -e nvidia -e vgpu -e nfd
nvcr.io/nvidia/cloud-native/dcgm                                3.3.7-1-ubuntu22.04            292733d61a20b       1.99GB
nvcr.io/nvidia/cloud-native/gpu-operator-validator              latest                         8371f914ffba3       324MB
nvcr.io/nvidia/cloud-native/gpu-operator-validator              v24.6.2                        8371f914ffba3       324MB
nvcr.io/nvidia/cloud-native/k8s-cc-manager                      v0.1.1                         c5006389d56b3       647MB
nvcr.io/nvidia/cloud-native/k8s-driver-manager                  v0.6.10                        dd9cff3ea5509       590MB
nvcr.io/nvidia/cloud-native/k8s-kata-manager                    v0.2.1                         082572359e199       449MB
nvcr.io/nvidia/cloud-native/vgpu-device-manager                 v0.2.7                         9f7a380f3f3e0       419MB
nvcr.io/nvidia/cuda                                             12.6.1-base-ubi8               103c9a2598a96       389MB
nvcr.io/nvidia/driver                                           550.90.07                      62f8c7903995a       1.17GB
nvcr.io/nvidia/gpu-operator                                     v24.6.2                        d57aeeb1c5a37       623MB
nvcr.io/nvidia/k8s-device-plugin                                v0.16.2-ubi8                   44edb05883259       505MB
nvcr.io/nvidia/k8s/container-toolkit                            v1.16.2-ubuntu20.04            bdcc66b183991       350MB
nvcr.io/nvidia/k8s/dcgm-exporter                                3.3.7-3.5.0-ubuntu22.04        ee8c6dfbf28aa       350MB
nvcr.io/nvidia/kubevirt-gpu-device-plugin                       v1.2.9                         3b4407d30d0d6       415MB
registry.k8s.io/nfd/node-feature-discovery                      v0.16.3                        bc292d823f05c       226MB

error information of driver-validation pod

2024-10-24T09:25:48.176690024+08:00 running command chroot with args [/run/nvidia/driver nvidia-smi]
2024-10-24T09:25:48.178993123+08:00 chroot: failed to run command 'nvidia-smi': No such file or directory
2024-10-24T09:25:48.179246578+08:00 command failed, retrying after 5 seconds
2024-10-24T09:25:53.179544295+08:00 running command chroot with args [/run/nvidia/driver nvidia-smi]
2024-10-24T09:25:53.181845219+08:00 chroot: failed to run command 'nvidia-smi': No such file or directory
2024-10-24T09:25:53.182124683+08:00 command failed, retrying after 5 seconds
2024-10-24T09:25:58.182391151+08:00 running command chroot with args [/run/nvidia/driver nvidia-smi]
2024-10-24T09:25:58.184547885+08:00 chroot: failed to run command 'nvidia-smi': No such file or directory
2024-10-24T09:25:58.184862084+08:00 command failed, retrying after 5 seconds
2024-10-24T09:26:03.185091032+08:00 running command chroot with args [/run/nvidia/driver nvidia-smi]
2024-10-24T09:26:03.187436039+08:00 chroot: failed to run command 'nvidia-smi': No such file or directory
2024-10-24T09:26:03.187669887+08:00 command failed, retrying after 5 seconds
2024-10-24T09:26:08.187985477+08:00 running command chroot with args [/run/nvidia/driver nvidia-smi]
2024-10-24T09:26:08.190198327+08:00 chroot: failed to run command 'nvidia-smi': No such file or directory
2024-10-24T09:26:08.190459947+08:00 command failed, retrying after 5 seconds
2024-10-24T09:26:13.190717358+08:00 running command chroot with args [/run/nvidia/driver nvidia-smi]
2024-10-24T09:26:13.192869694+08:00 chroot: failed to run command 'nvidia-smi': No such file or directory
2024-10-24T09:26:13.193131254+08:00 command failed, retrying after 5 seconds

K8S event info

# kubectl get event -n gpu-operator
LAST SEEN   TYPE     REASON              OBJECT                                                             MESSAGE
12m         Normal   LeaderElection      lease/53822513.nvidia.com                                          gpu-operator-7d66589d9b-rkqrm_4bb2ab72-3969-4020-87e4-1704ded2e72d became leader
12m         Normal   Scheduled           pod/gpu-feature-discovery-z4v8q                                    Successfully assigned gpu-operator/gpu-feature-discovery-z4v8q to de9e0472.secctr.com
12m         Normal   Pulled              pod/gpu-feature-discovery-z4v8q                                    Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.6.2" already present on machine
12m         Normal   Created             pod/gpu-feature-discovery-z4v8q                                    Created container toolkit-validation
12m         Normal   Started             pod/gpu-feature-discovery-z4v8q                                    Started container toolkit-validation
12m         Normal   SuccessfulCreate    daemonset/gpu-feature-discovery                                    Created pod: gpu-feature-discovery-z4v8q
12m         Normal   Scheduled           pod/gpu-operator-7d66589d9b-rkqrm                                  Successfully assigned gpu-operator/gpu-operator-7d66589d9b-rkqrm to de9e0472.secctr.com
12m         Normal   Pulled              pod/gpu-operator-7d66589d9b-rkqrm                                  Container image "nvcr.io/nvidia/gpu-operator:v24.6.2" already present on machine
12m         Normal   Created             pod/gpu-operator-7d66589d9b-rkqrm                                  Created container gpu-operator
12m         Normal   Started             pod/gpu-operator-7d66589d9b-rkqrm                                  Started container gpu-operator
12m         Normal   SuccessfulCreate    replicaset/gpu-operator-7d66589d9b                                 Created pod: gpu-operator-7d66589d9b-rkqrm
12m         Normal   Scheduled           pod/gpu-operator-node-feature-discovery-gc-7478549676-zzr66        Successfully assigned gpu-operator/gpu-operator-node-feature-discovery-gc-7478549676-zzr66 to de9e0472.secctr.com
12m         Normal   Pulled              pod/gpu-operator-node-feature-discovery-gc-7478549676-zzr66        Container image "registry.k8s.io/nfd/node-feature-discovery:v0.16.3" already present on machine
12m         Normal   Created             pod/gpu-operator-node-feature-discovery-gc-7478549676-zzr66        Created container gc
12m         Normal   Started             pod/gpu-operator-node-feature-discovery-gc-7478549676-zzr66        Started container gc
12m         Normal   SuccessfulCreate    replicaset/gpu-operator-node-feature-discovery-gc-7478549676       Created pod: gpu-operator-node-feature-discovery-gc-7478549676-zzr66
12m         Normal   ScalingReplicaSet   deployment/gpu-operator-node-feature-discovery-gc                  Scaled up replica set gpu-operator-node-feature-discovery-gc-7478549676 to 1
12m         Normal   Scheduled           pod/gpu-operator-node-feature-discovery-master-67769784f5-7pqvj    Successfully assigned gpu-operator/gpu-operator-node-feature-discovery-master-67769784f5-7pqvj to de9e0472.secctr.com
12m         Normal   Pulled              pod/gpu-operator-node-feature-discovery-master-67769784f5-7pqvj    Container image "registry.k8s.io/nfd/node-feature-discovery:v0.16.3" already present on machine
12m         Normal   Created             pod/gpu-operator-node-feature-discovery-master-67769784f5-7pqvj    Created container master
12m         Normal   Started             pod/gpu-operator-node-feature-discovery-master-67769784f5-7pqvj    Started container master
12m         Normal   SuccessfulCreate    replicaset/gpu-operator-node-feature-discovery-master-67769784f5   Created pod: gpu-operator-node-feature-discovery-master-67769784f5-7pqvj
12m         Normal   ScalingReplicaSet   deployment/gpu-operator-node-feature-discovery-master              Scaled up replica set gpu-operator-node-feature-discovery-master-67769784f5 to 1
12m         Normal   Scheduled           pod/gpu-operator-node-feature-discovery-worker-6bwfw               Successfully assigned gpu-operator/gpu-operator-node-feature-discovery-worker-6bwfw to de9e0472.secctr.com
12m         Normal   Pulled              pod/gpu-operator-node-feature-discovery-worker-6bwfw               Container image "registry.k8s.io/nfd/node-feature-discovery:v0.16.3" already present on machine
12m         Normal   Created             pod/gpu-operator-node-feature-discovery-worker-6bwfw               Created container worker
12m         Normal   Started             pod/gpu-operator-node-feature-discovery-worker-6bwfw               Started container worker
12m         Normal   SuccessfulCreate    daemonset/gpu-operator-node-feature-discovery-worker               Created pod: gpu-operator-node-feature-discovery-worker-6bwfw
12m         Normal   ScalingReplicaSet   deployment/gpu-operator                                            Scaled up replica set gpu-operator-7d66589d9b to 1
12m         Normal   Scheduled           pod/nvidia-container-toolkit-daemonset-lkqz9                       Successfully assigned gpu-operator/nvidia-container-toolkit-daemonset-lkqz9 to de9e0472.secctr.com
12m         Normal   Pulled              pod/nvidia-container-toolkit-daemonset-lkqz9                       Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.6.2" already present on machine
12m         Normal   Created             pod/nvidia-container-toolkit-daemonset-lkqz9                       Created container driver-validation
12m         Normal   Started             pod/nvidia-container-toolkit-daemonset-lkqz9                       Started container driver-validation
12m         Normal   SuccessfulCreate    daemonset/nvidia-container-toolkit-daemonset                       Created pod: nvidia-container-toolkit-daemonset-lkqz9
12m         Normal   Scheduled           pod/nvidia-dcgm-exporter-lxjqv                                     Successfully assigned gpu-operator/nvidia-dcgm-exporter-lxjqv to de9e0472.secctr.com
12m         Normal   Pulled              pod/nvidia-dcgm-exporter-lxjqv                                     Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.6.2" already present on machine
12m         Normal   Created             pod/nvidia-dcgm-exporter-lxjqv                                     Created container toolkit-validation
12m         Normal   Started             pod/nvidia-dcgm-exporter-lxjqv                                     Started container toolkit-validation
12m         Normal   SuccessfulCreate    daemonset/nvidia-dcgm-exporter                                     Created pod: nvidia-dcgm-exporter-lxjqv
12m         Normal   Scheduled           pod/nvidia-device-plugin-daemonset-5tncc                           Successfully assigned gpu-operator/nvidia-device-plugin-daemonset-5tncc to de9e0472.secctr.com
12m         Normal   Pulled              pod/nvidia-device-plugin-daemonset-5tncc                           Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.6.2" already present on machine
12m         Normal   Created             pod/nvidia-device-plugin-daemonset-5tncc                           Created container toolkit-validation
12m         Normal   Started             pod/nvidia-device-plugin-daemonset-5tncc                           Started container toolkit-validation
12m         Normal   SuccessfulCreate    daemonset/nvidia-device-plugin-daemonset                           Created pod: nvidia-device-plugin-daemonset-5tncc
12m         Normal   Scheduled           pod/nvidia-driver-daemonset-vpvv6                                  Successfully assigned gpu-operator/nvidia-driver-daemonset-vpvv6 to de9e0472.secctr.com
12m         Normal   Pulled              pod/nvidia-driver-daemonset-vpvv6                                  Container image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.10" already present on machine
12m         Normal   Created             pod/nvidia-driver-daemonset-vpvv6                                  Created container k8s-driver-manager
12m         Normal   Started             pod/nvidia-driver-daemonset-vpvv6                                  Started container k8s-driver-manager
12m         Normal   Killing             pod/nvidia-driver-daemonset-vpvv6                                  Stopping container k8s-driver-manager
12m         Normal   SuccessfulCreate    daemonset/nvidia-driver-daemonset                                  Created pod: nvidia-driver-daemonset-vpvv6
12m         Normal   SuccessfulDelete    daemonset/nvidia-driver-daemonset                                  Deleted pod: nvidia-driver-daemonset-vpvv6
12m         Normal   Scheduled           pod/nvidia-operator-validator-jctzl                                Successfully assigned gpu-operator/nvidia-operator-validator-jctzl to de9e0472.secctr.com
12m         Normal   Pulled              pod/nvidia-operator-validator-jctzl                                Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.6.2" already present on machine
12m         Normal   Created             pod/nvidia-operator-validator-jctzl                                Created container driver-validation
12m         Normal   Started             pod/nvidia-operator-validator-jctzl                                Started container driver-validation
12m         Normal   SuccessfulCreate    daemonset/nvidia-operator-validator                                Created pod: nvidia-operator-validator-jctzl

infomation of /run/nvidia/

# ls /run/nvidia/
driver  mps  toolkit  validations
# ls /run/nvidia/driver/
lib
# ls /run/nvidia/driver/lib/
firmware
# ls /run/nvidia/driver/lib/firmware/
# ls /run/nvidia/mps/
# ls /run/nvidia/toolkit/
# ls /run/nvidia/validations/
#

question: how to setup /run/nvidia/driver/ with nvidia-smi and so on?

vanloswang avatar Oct 24 '24 13:10 vanloswang