Why there is no GPU resource allocatable on a GPU cloud instance
when i describe node, there is no gpu resource, why?
Capacity:
cpu: 48
ephemeral-storage: 574137520Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263603720Ki
pods: 110
Allocatable:
cpu: 48
ephemeral-storage: 529125137556
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263501320Ki
pods: 110
(this is the node description)
- I have installed nvidia driver
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla P4 Off | 00000000:86:00.0 Off | 0 |
| N/A 28C P8 6W / 75W | 4MiB / 7680MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla P4 Off | 00000000:87:00.0 Off | 0 |
| N/A 29C P8 6W / 75W | 4MiB / 7680MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Tesla P4 Off | 00000000:AF:00.0 Off | 0 |
| N/A 32C P8 6W / 75W | 4MiB / 7680MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 Tesla P4 Off | 00000000:D8:00.0 Off | 0 |
| N/A 31C P8 6W / 75W | 4MiB / 7680MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
(this is nvidia driver for tesla p4)
- I have installed nvidia container toolkit, and configured the runtime as containerd
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
(this is the containerd config for nvidia container runtime)
3.I have installed nvidia k8s plugin nvidia-device-plugin
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-flannel kube-flannel-ds-x2pzs 1/1 Running 2 (16h ago) 7d18h
kube-system coredns-66f779496c-2k9mg 1/1 Running 2 (16h ago) 7d18h
kube-system coredns-66f779496c-nr6tz 1/1 Running 2 (16h ago) 7d18h
kube-system etcd-ubuntu-2288h-v5 1/1 Running 3 (16h ago) 7d18h
kube-system kube-apiserver-ubuntu-2288h-v5 1/1 Running 3 (16h ago) 7d18h
kube-system kube-controller-manager-ubuntu-2288h-v5 1/1 Running 3 (16h ago) 7d18h
kube-system kube-proxy-p6gk9 1/1 Running 2 (16h ago) 7d18h
kube-system kube-scheduler-ubuntu-2288h-v5 1/1 Running 3 (16h ago) 7d18h
kube-system metrics-server-6875467c8d-k6sd6 1/1 Running 2 (16h ago) 2d15h
kube-system nvidia-device-plugin-daemonset-57kxg 1/1 Running 0 10h
(this is the nvidia device plugin for k8s)
does anyone know the problem? thanks.
Having the same problem
you need install GFD or label you node
add parameter while generate containerd config
nvidia-ctk runtime configure --runtime=containerd --set-as-default
you need install GFD or label you node
thanks, will try
add parameter while generate containerd config
nvidia-ctk runtime configure --runtime=containerd --set-as-default
thanks
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
This issue was automatically closed due to inactivity.