minikube cannot detect the GPUs
I used your image to create a container. In the container, I installed minikube. When I run minikube start, the node minikube didn't detect any GPU. I am wondering how to fix this. By the way, the command nvidia-smi works well.
docker run --gpus 1 -it --privileged --name ElasticDL -d ghcr.io/ehfd/nvidia-dind:latest
docker exec -it ElasticDL /bin/bash
# install minikube
minikube start
alias kubectl="minikube kubectl --"
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
Then the result shows that the GPU num is <none>
When I run a pod, the pod status is:
FailedScheduling
0/1 nodes are available: 1 Insufficient ndefault-schedulervidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
Perhaps you need k8s-device-plugin?
https://github.com/NVIDIA/k8s-device-plugin
Thanks, but I have tried this. I start from the section "Enabling GPU Support in Kubernetes". I think this image has done the work before this section, I am not sure if it is right?
root@440403c45a7b:/usr/src# kubectl get pod -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-5d78c9869d-wttcf 1/1 Running 0 2m36s
kube-system etcd-minikube 1/1 Running 0 5m27s
kube-system kube-apiserver-minikube 1/1 Running 0 4m4s
kube-system kube-controller-manager-minikube 1/1 Running 4 (4m1s ago) 5m23s
kube-system kube-proxy-dv6gc 1/1 Running 0 2m36s
kube-system kube-scheduler-minikube 1/1 Running 0 4m15s
kube-system nvidia-device-plugin-daemonset-lppbw 1/1 Running 0 2m27s
kube-system storage-provisioner 1/1 Running 1 (2m20s ago) 3m8s
Then I run the command: kubectl apply -f test-gpu.yaml
The content of the test-gpu.yaml is :
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
The pod status is Pending, the detail is :
root@440403c45a7b:/usr/src# kubectl describe pod gpu-pod
Name: gpu-pod
Namespace: default
Priority: 0
Service Account: default
Node: <none>
Labels: <none>
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Containers:
cuda-container:
Image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
Port: <none>
Host Port: <none>
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8d7kt (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
kube-api-access-8d7kt:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 24s default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
I also tried to start from the Preparing your GPU Nodes. But there is some difficulty with systemctl. I tried to install the command systemctl and then systemctl restart docker. The container will exit.
Perhaps https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#configure-containerd or https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#configure-docker is the problematic location.
I am trying to implement these, but systemctl is not supported in this image. I get confused about how to run systemctl restart docker. I tried some ways to install systemctl but still failed to restart docker. Any suggestion is appreciated :>
(sudo) supervisorctl restart dockerd
Still not work.
- I restart a container.
docker run --gpus 1 -it --privileged --name ElasticDL -d elasticdl:v1. The imageelasticdl:v1only addsminikube. - run
docker exec -it ElasticDL /bin/bash - Configure the
/etc/docker/daemon.json. There is no file named/etc/containerd/config.tomland no service namedcontainerd, so I didn't do this.
root@c0ac3df639d6:/usr/bin# supervisorctl restart dockerd
dockerd: stopped
dockerd: started
- run
minikube start
root@c0ac3df639d6:/usr/bin# minikube start --force
* minikube v1.31.2 on Ubuntu 22.04 (docker/amd64)
! minikube skips various validations when --force is supplied; this may lead to unexpected behavior
* Using the docker driver based on existing profile
* The "docker" driver should not be used with root privileges. If you wish to continue as root, use --force.
* If you are running minikube within a VM, consider using --driver=none:
* https://minikube.sigs.k8s.io/docs/reference/drivers/none/
* Tip: To remove this root owned cluster, run: sudo minikube delete
* Starting control plane node minikube in cluster minikube
* Pulling base image ...
* Restarting existing docker container for "minikube" ...
* Preparing Kubernetes v1.27.4 on Docker 24.0.4 ...
* Configuring bridge CNI (Container Networking Interface) ...
- Using image gcr.io/k8s-minikube/storage-provisioner:v5
* Verifying Kubernetes components...
* Enabled addons: default-storageclass, storage-provisioner
* kubectl not found. If you need it, try: 'minikube kubectl -- get pods -A'
* Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default
- run
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu", and the result is bellow. I am not sure if this is the reason as the GPU is<none>. NAME GPU minikube - run
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml - run
test-gpu.yaml - run
kubectl get pod -A. The podgpu-podstatus is the same.
root@c0ac3df639d6:/usr/bin# kubectl get pod -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default gpu-pod 0/1 Pending 0 11m
kube-system coredns-5d78c9869d-r8x22 1/1 Running 1 (8m7s ago) 17m
kube-system etcd-minikube 1/1 Running 1 (8m11s ago) 19m
kube-system kube-apiserver-minikube 1/1 Running 1 (8m11s ago) 18m
kube-system kube-controller-manager-minikube 1/1 Running 6 (5m10s ago) 20m
kube-system kube-proxy-8tl8c 1/1 Running 1 (8m12s ago) 17m
kube-system kube-scheduler-minikube 1/1 Running 1 (8m12s ago) 19m
kube-system nvidia-device-plugin-daemonset-rbppm 1/1 Running 1 13m
kube-system storage-provisioner 1/1 Running 3 (4m53s ago) 17m
I also tried this document: https://github.com/intelligent-machine-learning/dlrover/blob/master/docs/tutorial/gpu_user_guide.md, similar to the Nvidia's document. Still get the same result.
root@c0ac3df639d6:/usr/src# kubectl describe pod nvidia-device-plugin-daemonset-r9spv -n kube-system
Name: nvidia-device-plugin-daemonset-r9spv
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: default
Node: minikube/192.168.49.2
Start Time: Fri, 29 Sep 2023 12:27:34 +0000
Labels: controller-revision-hash=586d67c5
name=nvidia-device-plugin-ds
pod-template-generation=2
Annotations: scheduler.alpha.kubernetes.io/critical-pod:
Status: Running
IP: 10.244.0.9
IPs:
IP: 10.244.0.9
Controlled By: DaemonSet/nvidia-device-plugin-daemonset
Containers:
nvidia-device-plugin-ctr:
Container ID: docker://e3994f9d249dbf33089fced497bb52ba8233f84b53bf4b76c72fc33cc58df1f2
Image: nvidia/k8s-device-plugin:1.11
Image ID: docker-pullable://nvidia/k8s-device-plugin@sha256:41b3531d338477d26eb1151c15d0bea130d31e690752315a5205d8094439b0a6
Port: <none>
Host Port: <none>
State: Running
Started: Fri, 29 Sep 2023 12:28:36 +0000
Ready: True
Restart Count: 0
Environment:
FAIL_ON_INIT_ERROR: false
Mounts:
/var/lib/kubelet/device-plugins from device-plugin (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tcq6h (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
kube-api-access-tcq6h:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: CriticalAddonsOnly op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m43s default-scheduler Successfully assigned kube-system/nvidia-device-plugin-daemonset-r9spv to minikube
Normal Pulling 6m34s kubelet Pulling image "nvidia/k8s-device-plugin:1.11"
Normal Pulled 6m8s kubelet Successfully pulled image "nvidia/k8s-device-plugin:1.11" in 25.768299986s (25.768309412s including waiting)
Normal Created 5m43s kubelet Created container nvidia-device-plugin-ctr
Normal Started 5m40s kubelet Started container nvidia-device-plugin-ctr
root@c0ac3df639d6:/usr/src# kubectl describe pod gpu-pod
Name: gpu-pod
Namespace: default
Priority: 0
Service Account: default
Node: <none>
Labels: <none>
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Containers:
cuda-container:
Image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
Port: <none>
Host Port: <none>
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bbdkm (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
kube-api-access-bbdkm:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 20s (x2 over 5m20s) default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
I've updated the NVIDIA container toolkit. Please see if this might solve anything.
Hello @darrenglow \ CC @ehfd How will you be able to run Minikube inside a container? I have been trying for a long time, but I keep getting OCI and Cgroup errors. Can you help me with this?
I have no idea... Perhaps try KinD?
Sure I'll try it...Thanks
https://www.substratus.ai/blog/kind-with-gpus/ https://github.com/kubernetes-sigs/kind/pull/3257#issuecomment-1607287275
Both actually look relevant/applicable here too.
@ehfd I have tried but it did'nt work :((
https://github.com/NVIDIA/k8s-device-plugin/issues/332#issuecomment-1917537232 One more resource.
Thank you, @ehfd . I’ve already explored that resource, but unfortunately, it didn’t work too. However, I’ve now switched to using virtual machines.