k8s-device-plugin Unable to retrieve list of available devices: error creating nvml.Device 3: nvml: GPU is lost, which is unexpected

We maintain a k8s cluster with multiple nodes that each have 4 Nvidia GPUs. Occasionally, one of the GPUs crashes. While that's unfortunate, the main issue is that a single GPU crashing causes the 3 other GPUs become unallocatable. All pod scheduled on the node won't start because of the following error:

Pod Allocate failed due to device plugin GetPreferredAllocation rpc failed with err: rpc error: code = Unknown desc = Unable to retrieve list of available devices: error creating nvml.Device 3: nvml: GPU is lost, which is unexpected

Also, our application that use those GPUs is managed by a Deployment. When a GPU crashes, the Deployment attempts to recreate a Pod without removing the previous Failed Pod, which accumulates (we saw up to 12k Pods) slowing down the entire cluster.

In the daemon set config, we already set --fail-on-init-error=false.

Common error checking:

The output of nvidia-smi -a on your host: Unable to determine the device handle for GPU 0000:C1:00.0: GPU is lost. Reboot the system to recover this GPU

Additional information that might help better understand your environment and reproduce the bug:

Docker version from docker version: 20.10.2
Kernel version from uname -a: Linux node-11 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*': 450.102.04-0ubuntu0.20.04.1
NVIDIA container library version from nvidia-container-cli -V: 1.3.1

Mar 05 '21 17:03 ArthurMelin

I have same Error。

Nov 28 '23 02:11 seaurching

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

Feb 28 '24 04:02 github-actions[bot]