CCM does not properly detect nodes that are powered off
Bug Reporting
CCM does not properly detect nodes that are powered off.
Expected Behavior
On shutdown of a Kubernetes node, the CCM detects that it is powered off and migrates workloads off of the node.
Actual Behavior
The node status becomes NotReady after a few minutes because kubelet stops responding, but pods still have a status of Running and do not get rescheduled.
Steps to Reproduce the Problem
- Shut down a Kubernetes node on a cluster running the CCM, wait 5 minutes
- Observe node status with
kubectl get nodes, note that the down node has a status ofNotReady - Observe pod status with
kubectl get pods -A, note that pods on the node are not rescheduled
This could be pod-eviction-timeout https://kubernetes.io/docs/concepts/architecture/nodes/#condition
or taint-based eviction policies: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions
I'm adding another context hint on this topic, Non-Graceful shutdowns.
- https://kubernetes.io/blog/2022/05/20/kubernetes-1-24-non-graceful-node-shutdown-alpha/
- https://kubernetes.io/docs/concepts/architecture/nodes/#non-graceful-node-shutdown
- https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/2268-non-graceful-shutdown/README.md
I don't see any conversations about how a cloud-provider implementation would be expected to signal the shutdown conditions, perhaps that will come after the Alpha phase of the feature.