k8s-device-plugin Why there is no GPU resource allocatable on a GPU cloud instance

when i describe node, there is no gpu resource, why?

Capacity:
  cpu:                48
  ephemeral-storage:  574137520Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             263603720Ki
  pods:               110
Allocatable:
  cpu:                48
  ephemeral-storage:  529125137556
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             263501320Ki
  pods:               110

(this is the node description)

I have installed nvidia driver

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P4                       Off | 00000000:86:00.0 Off |                    0 |
| N/A   28C    P8               6W /  75W |      4MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla P4                       Off | 00000000:87:00.0 Off |                    0 |
| N/A   29C    P8               6W /  75W |      4MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla P4                       Off | 00000000:AF:00.0 Off |                    0 |
| N/A   32C    P8               6W /  75W |      4MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla P4                       Off | 00000000:D8:00.0 Off |                    0 |
| N/A   31C    P8               6W /  75W |      4MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

(this is nvidia driver for tesla p4)

I have installed nvidia container toolkit, and configured the runtime as containerd

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

(this is the containerd config for nvidia container runtime)

3.I have installed nvidia k8s plugin nvidia-device-plugin

NAMESPACE      NAME                                      READY   STATUS    RESTARTS      AGE
kube-flannel   kube-flannel-ds-x2pzs                     1/1     Running   2 (16h ago)   7d18h
kube-system    coredns-66f779496c-2k9mg                  1/1     Running   2 (16h ago)   7d18h
kube-system    coredns-66f779496c-nr6tz                  1/1     Running   2 (16h ago)   7d18h
kube-system    etcd-ubuntu-2288h-v5                      1/1     Running   3 (16h ago)   7d18h
kube-system    kube-apiserver-ubuntu-2288h-v5            1/1     Running   3 (16h ago)   7d18h
kube-system    kube-controller-manager-ubuntu-2288h-v5   1/1     Running   3 (16h ago)   7d18h
kube-system    kube-proxy-p6gk9                          1/1     Running   2 (16h ago)   7d18h
kube-system    kube-scheduler-ubuntu-2288h-v5            1/1     Running   3 (16h ago)   7d18h
kube-system    metrics-server-6875467c8d-k6sd6           1/1     Running   2 (16h ago)   2d15h
kube-system    nvidia-device-plugin-daemonset-57kxg      1/1     Running   0             10h

(this is the nvidia device plugin for k8s)

does anyone know the problem? thanks.

Jul 19 '24 10:07 shizhouhu

Having the same problem

Jul 24 '24 09:07 jaffe-fly

you need install GFD or label you node

Aug 01 '24 13:08 jaffe-fly

add parameter while generate containerd config

nvidia-ctk runtime configure --runtime=containerd --set-as-default

Aug 27 '24 13:08 Bugaoxingxx

you need install GFD or label you node

thanks, will try

Sep 17 '24 05:09 shizhouhu

add parameter while generate containerd config

nvidia-ctk runtime configure --runtime=containerd --set-as-default

thanks

Sep 17 '24 05:09 shizhouhu

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

Dec 17 '24 04:12 github-actions[bot]

This issue was automatically closed due to inactivity.

Jan 17 '25 04:01 github-actions[bot]