Unable to retrieve list of available devices: error creating nvml.Device 3: nvml: GPU is lost, which is unexpected
We maintain a k8s cluster with multiple nodes that each have 4 Nvidia GPUs. Occasionally, one of the GPUs crashes. While that's unfortunate, the main issue is that a single GPU crashing causes the 3 other GPUs become unallocatable. All pod scheduled on the node won't start because of the following error:
Pod Allocate failed due to device plugin GetPreferredAllocation rpc failed with err: rpc error: code = Unknown desc = Unable to retrieve list of available devices: error creating nvml.Device 3: nvml: GPU is lost, which is unexpected
Also, our application that use those GPUs is managed by a Deployment. When a GPU crashes, the Deployment attempts to recreate a Pod without removing the previous Failed Pod, which accumulates (we saw up to 12k Pods) slowing down the entire cluster.
In the daemon set config, we already set --fail-on-init-error=false.
Common error checking:
- The output of
nvidia-smi -aon your host:Unable to determine the device handle for GPU 0000:C1:00.0: GPU is lost. Reboot the system to recover this GPU
Additional information that might help better understand your environment and reproduce the bug:
- Docker version from
docker version:20.10.2 - Kernel version from
uname -a:Linux node-11 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux - NVIDIA packages version from
dpkg -l '*nvidia*'orrpm -qa '*nvidia*':450.102.04-0ubuntu0.20.04.1 - NVIDIA container library version from
nvidia-container-cli -V:1.3.1
I have same Error。
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.