gpu-manager icon indicating copy to clipboard operation
gpu-manager copied to clipboard

Feature Request: GPU health check

Open git861 opened this issue 4 years ago • 2 comments

The official NVIDIA k8s-device-plugin supports GPU health monitoring, so that GPU having xid error will become usable and won't get assigned to pod. So is gpu-manager going to support this feature? If not, is the implementation feasible under the current framework? Thanks.

git861 avatar Jun 25 '21 04:06 git861

The official NVIDIA k8s-device-plugin supports GPU health monitoring, so that GPU having xid error will become usable and won't get assigned to pod. So is gpu-manager going to support this feature? If not, is the implementation feasible under the current framework? Thanks.

I'm working on refactoring the gpu-manager, and will consider this feature into the refactor.

mYmNeo avatar Jun 28 '21 00:06 mYmNeo

The official NVIDIA k8s-device-plugin supports GPU health monitoring, so that GPU having xid error will become usable and won't get assigned to pod. So is gpu-manager going to support this feature? If not, is the implementation feasible under the current framework? Thanks.

I'm working on refactoring the gpu-manager, and will consider this feature into the refactor.

When will the refactor be completed? Looking forward to it !

fighterhit avatar Sep 29 '21 08:09 fighterhit