k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

Why there is no GPU resource allocatable on a GPU cloud instance

Open shizhouhu opened this issue 1 year ago • 5 comments

when i describe node, there is no gpu resource, why?

Capacity:
  cpu:                48
  ephemeral-storage:  574137520Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             263603720Ki
  pods:               110
Allocatable:
  cpu:                48
  ephemeral-storage:  529125137556
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             263501320Ki
  pods:               110

(this is the node description)

  1. I have installed nvidia driver
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P4                       Off | 00000000:86:00.0 Off |                    0 |
| N/A   28C    P8               6W /  75W |      4MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla P4                       Off | 00000000:87:00.0 Off |                    0 |
| N/A   29C    P8               6W /  75W |      4MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla P4                       Off | 00000000:AF:00.0 Off |                    0 |
| N/A   32C    P8               6W /  75W |      4MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla P4                       Off | 00000000:D8:00.0 Off |                    0 |
| N/A   31C    P8               6W /  75W |      4MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

(this is nvidia driver for tesla p4)

  1. I have installed nvidia container toolkit, and configured the runtime as containerd
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

(this is the containerd config for nvidia container runtime)

3.I have installed nvidia k8s plugin nvidia-device-plugin

NAMESPACE      NAME                                      READY   STATUS    RESTARTS      AGE
kube-flannel   kube-flannel-ds-x2pzs                     1/1     Running   2 (16h ago)   7d18h
kube-system    coredns-66f779496c-2k9mg                  1/1     Running   2 (16h ago)   7d18h
kube-system    coredns-66f779496c-nr6tz                  1/1     Running   2 (16h ago)   7d18h
kube-system    etcd-ubuntu-2288h-v5                      1/1     Running   3 (16h ago)   7d18h
kube-system    kube-apiserver-ubuntu-2288h-v5            1/1     Running   3 (16h ago)   7d18h
kube-system    kube-controller-manager-ubuntu-2288h-v5   1/1     Running   3 (16h ago)   7d18h
kube-system    kube-proxy-p6gk9                          1/1     Running   2 (16h ago)   7d18h
kube-system    kube-scheduler-ubuntu-2288h-v5            1/1     Running   3 (16h ago)   7d18h
kube-system    metrics-server-6875467c8d-k6sd6           1/1     Running   2 (16h ago)   2d15h
kube-system    nvidia-device-plugin-daemonset-57kxg      1/1     Running   0             10h

(this is the nvidia device plugin for k8s)

does anyone know the problem? thanks.

shizhouhu avatar Jul 19 '24 10:07 shizhouhu

Having the same problem

jaffe-fly avatar Jul 24 '24 09:07 jaffe-fly

you need install GFD or label you node

jaffe-fly avatar Aug 01 '24 13:08 jaffe-fly

add parameter while generate containerd config

nvidia-ctk runtime configure --runtime=containerd --set-as-default

Bugaoxingxx avatar Aug 27 '24 13:08 Bugaoxingxx

you need install GFD or label you node

thanks, will try

shizhouhu avatar Sep 17 '24 05:09 shizhouhu

add parameter while generate containerd config

nvidia-ctk runtime configure --runtime=containerd --set-as-default

thanks

shizhouhu avatar Sep 17 '24 05:09 shizhouhu

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Dec 17 '24 04:12 github-actions[bot]

This issue was automatically closed due to inactivity.

github-actions[bot] avatar Jan 17 '25 04:01 github-actions[bot]