Feature Request: GPU health check
The official NVIDIA k8s-device-plugin supports GPU health monitoring, so that GPU having xid error will become usable and won't get assigned to pod. So is gpu-manager going to support this feature? If not, is the implementation feasible under the current framework? Thanks.
The official NVIDIA k8s-device-plugin supports GPU health monitoring, so that GPU having xid error will become usable and won't get assigned to pod. So is gpu-manager going to support this feature? If not, is the implementation feasible under the current framework? Thanks.
I'm working on refactoring the gpu-manager, and will consider this feature into the refactor.
The official NVIDIA k8s-device-plugin supports GPU health monitoring, so that GPU having xid error will become usable and won't get assigned to pod. So is gpu-manager going to support this feature? If not, is the implementation feasible under the current framework? Thanks.
I'm working on refactoring the gpu-manager, and will consider this feature into the refactor.
When will the refactor be completed? Looking forward to it !