nvidia-dind icon indicating copy to clipboard operation
nvidia-dind copied to clipboard

minikube cannot detect the GPUs

Open andakai opened this issue 2 years ago • 15 comments

I used your image to create a container. In the container, I installed minikube. When I run minikube start, the node minikube didn't detect any GPU. I am wondering how to fix this. By the way, the command nvidia-smi works well.

docker run --gpus 1 -it --privileged --name ElasticDL -d ghcr.io/ehfd/nvidia-dind:latest
docker exec -it ElasticDL /bin/bash
# install minikube
minikube start
alias kubectl="minikube kubectl --"
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"

Then the result shows that the GPU num is <none>

When I run a pod, the pod status is:

FailedScheduling
0/1 nodes are available: 1 Insufficient ndefault-schedulervidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

andakai avatar Sep 29 '23 09:09 andakai

Perhaps you need k8s-device-plugin?

https://github.com/NVIDIA/k8s-device-plugin

ehfd avatar Sep 29 '23 09:09 ehfd

Thanks, but I have tried this. I start from the section "Enabling GPU Support in Kubernetes". I think this image has done the work before this section, I am not sure if it is right?

root@440403c45a7b:/usr/src# kubectl get pod -A
NAMESPACE     NAME                                   READY   STATUS    RESTARTS        AGE
kube-system   coredns-5d78c9869d-wttcf               1/1     Running   0               2m36s
kube-system   etcd-minikube                          1/1     Running   0               5m27s
kube-system   kube-apiserver-minikube                1/1     Running   0               4m4s
kube-system   kube-controller-manager-minikube       1/1     Running   4 (4m1s ago)    5m23s
kube-system   kube-proxy-dv6gc                       1/1     Running   0               2m36s
kube-system   kube-scheduler-minikube                1/1     Running   0               4m15s
kube-system   nvidia-device-plugin-daemonset-lppbw   1/1     Running   0               2m27s
kube-system   storage-provisioner                    1/1     Running   1 (2m20s ago)   3m8s

Then I run the command: kubectl apply -f test-gpu.yaml The content of the test-gpu.yaml is :

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

The pod status is Pending, the detail is :

root@440403c45a7b:/usr/src# kubectl describe pod gpu-pod
Name:             gpu-pod
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           <none>
Annotations:      <none>
Status:           Pending
IP:               
IPs:              <none>
Containers:
  cuda-container:
    Image:      nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
    Port:       <none>
    Host Port:  <none>
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8d7kt (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-8d7kt:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  24s   default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..

I also tried to start from the Preparing your GPU Nodes. But there is some difficulty with systemctl. I tried to install the command systemctl and then systemctl restart docker. The container will exit.

andakai avatar Sep 29 '23 10:09 andakai

Perhaps https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#configure-containerd or https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#configure-docker is the problematic location.

ehfd avatar Sep 29 '23 10:09 ehfd

I am trying to implement these, but systemctl is not supported in this image. I get confused about how to run systemctl restart docker. I tried some ways to install systemctl but still failed to restart docker. Any suggestion is appreciated :>

andakai avatar Sep 29 '23 10:09 andakai

(sudo) supervisorctl restart dockerd

ehfd avatar Sep 29 '23 10:09 ehfd

Still not work.

  1. I restart a container. docker run --gpus 1 -it --privileged --name ElasticDL -d elasticdl:v1. The image elasticdl:v1 only adds minikube.
  2. run docker exec -it ElasticDL /bin/bash
  3. Configure the /etc/docker/daemon.json. There is no file named /etc/containerd/config.toml and no service named containerd, so I didn't do this.
root@c0ac3df639d6:/usr/bin# supervisorctl restart dockerd
dockerd: stopped
dockerd: started
  1. run minikube start
root@c0ac3df639d6:/usr/bin# minikube start --force
* minikube v1.31.2 on Ubuntu 22.04 (docker/amd64)
! minikube skips various validations when --force is supplied; this may lead to unexpected behavior
* Using the docker driver based on existing profile
* The "docker" driver should not be used with root privileges. If you wish to continue as root, use --force.
* If you are running minikube within a VM, consider using --driver=none:
*   https://minikube.sigs.k8s.io/docs/reference/drivers/none/
* Tip: To remove this root owned cluster, run: sudo minikube delete
* Starting control plane node minikube in cluster minikube
* Pulling base image ...
* Restarting existing docker container for "minikube" ...
* Preparing Kubernetes v1.27.4 on Docker 24.0.4 ...
* Configuring bridge CNI (Container Networking Interface) ...
  - Using image gcr.io/k8s-minikube/storage-provisioner:v5
* Verifying Kubernetes components...
* Enabled addons: default-storageclass, storage-provisioner
* kubectl not found. If you need it, try: 'minikube kubectl -- get pods -A'
* Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default
  1. run kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu", and the result is bellow. I am not sure if this is the reason as the GPU is <none>. NAME GPU minikube
  2. run kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml
  3. run test-gpu.yaml
  4. run kubectl get pod -A. The pod gpu-pod status is the same.
root@c0ac3df639d6:/usr/bin# kubectl get pod -A
NAMESPACE     NAME                                   READY   STATUS    RESTARTS        AGE
default       gpu-pod                                0/1     Pending   0               11m
kube-system   coredns-5d78c9869d-r8x22               1/1     Running   1 (8m7s ago)    17m
kube-system   etcd-minikube                          1/1     Running   1 (8m11s ago)   19m
kube-system   kube-apiserver-minikube                1/1     Running   1 (8m11s ago)   18m
kube-system   kube-controller-manager-minikube       1/1     Running   6 (5m10s ago)   20m
kube-system   kube-proxy-8tl8c                       1/1     Running   1 (8m12s ago)   17m
kube-system   kube-scheduler-minikube                1/1     Running   1 (8m12s ago)   19m
kube-system   nvidia-device-plugin-daemonset-rbppm   1/1     Running   1               13m
kube-system   storage-provisioner                    1/1     Running   3 (4m53s ago)   17m

andakai avatar Sep 29 '23 11:09 andakai

I also tried this document: https://github.com/intelligent-machine-learning/dlrover/blob/master/docs/tutorial/gpu_user_guide.md, similar to the Nvidia's document. Still get the same result.

root@c0ac3df639d6:/usr/src# kubectl describe pod nvidia-device-plugin-daemonset-r9spv -n kube-system
Name:                 nvidia-device-plugin-daemonset-r9spv
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      default
Node:                 minikube/192.168.49.2
Start Time:           Fri, 29 Sep 2023 12:27:34 +0000
Labels:               controller-revision-hash=586d67c5
                      name=nvidia-device-plugin-ds
                      pod-template-generation=2
Annotations:          scheduler.alpha.kubernetes.io/critical-pod: 
Status:               Running
IP:                   10.244.0.9
IPs:
  IP:           10.244.0.9
Controlled By:  DaemonSet/nvidia-device-plugin-daemonset
Containers:
  nvidia-device-plugin-ctr:
    Container ID:   docker://e3994f9d249dbf33089fced497bb52ba8233f84b53bf4b76c72fc33cc58df1f2
    Image:          nvidia/k8s-device-plugin:1.11
    Image ID:       docker-pullable://nvidia/k8s-device-plugin@sha256:41b3531d338477d26eb1151c15d0bea130d31e690752315a5205d8094439b0a6
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Fri, 29 Sep 2023 12:28:36 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      FAIL_ON_INIT_ERROR:  false
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tcq6h (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
  kube-api-access-tcq6h:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 CriticalAddonsOnly op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  6m43s  default-scheduler  Successfully assigned kube-system/nvidia-device-plugin-daemonset-r9spv to minikube
  Normal  Pulling    6m34s  kubelet            Pulling image "nvidia/k8s-device-plugin:1.11"
  Normal  Pulled     6m8s   kubelet            Successfully pulled image "nvidia/k8s-device-plugin:1.11" in 25.768299986s (25.768309412s including waiting)
  Normal  Created    5m43s  kubelet            Created container nvidia-device-plugin-ctr
  Normal  Started    5m40s  kubelet            Started container nvidia-device-plugin-ctr
root@c0ac3df639d6:/usr/src# kubectl describe pod gpu-pod
Name:             gpu-pod
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           <none>
Annotations:      <none>
Status:           Pending
IP:               
IPs:              <none>
Containers:
  cuda-container:
    Image:      nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
    Port:       <none>
    Host Port:  <none>
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bbdkm (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-bbdkm:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  20s (x2 over 5m20s)  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..

andakai avatar Sep 29 '23 12:09 andakai

I've updated the NVIDIA container toolkit. Please see if this might solve anything.

ehfd avatar Nov 25 '23 10:11 ehfd

Hello @darrenglow \ CC @ehfd How will you be able to run Minikube inside a container? I have been trying for a long time, but I keep getting OCI and Cgroup errors. Can you help me with this?

rajat709 avatar Jan 01 '24 08:01 rajat709

I have no idea... Perhaps try KinD?

ehfd avatar Jan 05 '24 03:01 ehfd

Sure I'll try it...Thanks

rajat709 avatar Jan 05 '24 04:01 rajat709

https://www.substratus.ai/blog/kind-with-gpus/ https://github.com/kubernetes-sigs/kind/pull/3257#issuecomment-1607287275

Both actually look relevant/applicable here too.

ehfd avatar Jan 09 '24 06:01 ehfd

@ehfd I have tried but it did'nt work :((

rajat709 avatar Jan 30 '24 18:01 rajat709

https://github.com/NVIDIA/k8s-device-plugin/issues/332#issuecomment-1917537232 One more resource.

ehfd avatar Feb 25 '24 08:02 ehfd

Thank you, @ehfd . I’ve already explored that resource, but unfortunately, it didn’t work too. However, I’ve now switched to using virtual machines.

rajat709 avatar Feb 25 '24 13:02 rajat709