pod needing GPU or SRIOV failing scheduling after node restart

Open cicyle opened this issue 3 months ago • 0 comments

After node restart some pods in needs of GPU or VF might fails because the device ressources are not yet ready. It requires manual delete of failing pods. as example, one error can be reported by the failing pod as: Message: Pod was rejected: Allocate failed due to no healthy devices present; cannot allocate unhealthy devices openshift.io/media_a_rx_pool, which is unexpected,

As one solution, we could implement a daemonset that deletes those pods once all gpu/vf nodes have at least 1 device allocatable. I've done this

Oct 14 '25 09:10 cicyle