Working with descheduler
Hi there,
I'm using the descheduler to rescedule a pod to another node. However, the descheduler complaints that every node has insufficient resource "telemetry/scheduling", which prevent the pod being rescheduled. (I've checked the source code of the descheduler, and it only evicts pods that don't fit the current node and fit some other nodes, see the code below from node_affinity.go)
pods, err := podutil.ListPodsOnANode(
node.Name,
getPodsAssignedToNode,
podutil.WrapFilterFuncs(podFilter, func(pod *v1.Pod) bool {
return evictorFilter.Filter(pod) &&
!nodeutil.PodFitsCurrentNode(getPodsAssignedToNode, pod, node) &&
nodeutil.PodFitsAnyNode(getPodsAssignedToNode, pod, nodes)
}),
)
if err != nil {
klog.ErrorS(err, "Failed to get pods", "node", klog.KObj(node))
}
for _, pod := range pods {
if pod.Spec.Affinity != nil && pod.Spec.Affinity.NodeAffinity != nil && pod.Spec.Affinity.NodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution != nil {
klog.V(1).InfoS("Evicting pod", "pod", klog.KObj(pod))
if _, err := podEvictor.EvictPod(ctx, pod, node, "NodeAffinity"); err != nil {
klog.ErrorS(err, "Error evicting pod")
break
}
}
}
I'm using the setting up of health-metric-demo. The logs of the descheduler are like:
I0628 14:32:11.838531 72888 node_affinity.go:78] "Processing node" node="minikube-m02"
I0628 14:32:11.838554 72888 node.go:183] "Pod does not fit on node" pod="default/demo-app-77fbd8745b-hhsxl" node="minikube-m02"
I0628 14:32:11.838557 72888 node.go:185] "insufficient telemetry/scheduling"
I0628 14:32:11.838568 72888 node.go:166] "Pod does not fit on node" pod="default/demo-app-77fbd8745b-hhsxl" node="minikube"
I0628 14:32:11.838571 72888 node.go:168] "insufficient telemetry/scheduling"
I0628 14:32:11.838579 72888 node.go:166] "Pod does not fit on node" pod="default/demo-app-77fbd8745b-hhsxl" node="minikube-m02"
I0628 14:32:11.838582 72888 node.go:168] "insufficient telemetry/scheduling"
I0628 14:32:11.838591 72888 node.go:166] "Pod does not fit on node" pod="default/demo-app-77fbd8745b-hhsxl" node="minikube-m03"
I0628 14:32:11.838619 72888 node.go:168] "pod node selector does not match the node label"
I0628 14:32:11.838624 72888 node.go:168] "insufficient telemetry/scheduling"
I0628 14:32:11.839395 72888 descheduler.go:312] "Number of evicted pods" totalEvicted=0
In your instructions for the health-demo, the pod simply re-scheduled to another node, so I'm wondering how do you work around this problem?
Many thanks!
The health-demo predates the nodeFit feature of the descheduler. You might want to turn nodeFit off in your descheduler configuration, and see if that helps a bit. Based on a quick look at the nodeFit implementation in the descheduler it looks to me it doesn't care about the configuration of the scheduler, and as a result, it doesn't honor the fact that the resources in question are in fact configured as ignoredByScheduler
If my quick analysis above is correct, then the proper place for the issue would be at the descheduler project, which should in my opinion either honor the scheduler config somehow directly, and automatically ignore those resources (this would be nice indeed), or at least allow for configuring the same resources as ignoredByDescheduler.
As a workaround, if you need to keep the nodeFit feature enabled, you might resort to what I previously told you not to do: create the extended resource for the telemetry resource by hand (curl) for the nodes. The scheduler should still be configured to ignore the resource, so it won't be consumed.
Another issue which you may or may not stumble at with the descheduler node_affinity strategy is the fact that it will not deschedule anything unless there is another node ready where the pod could be scheduled to. Ref descheduler issue nr 640. Turning the nodeFit off won't help for that, unfortunately.
Filed to the descheduler project, we'll see what the solution will be: https://github.com/kubernetes-sigs/descheduler/issues/863
Thanks! I also noticed that the nodeFit flag is not taking effects, but good news is that posting the extended resource to all work nodes would solve the problem for now. Though it would be nice if the flag thing could be fixed later on.
Thanks! I also noticed that the nodeFit flag is not taking effects, but good news is that posting the extended resource to all work nodes would solve the problem for now. Though it would be nice if the flag thing could be fixed later on.
That flag is particularly confusing, as witnessed by descheduler issue number 845. Feel free to chime in, perhaps more people with the same expectations about the flag would make the maintainers reconsider the point that maybe turning the flag off should make some sort of a difference.
hazxel, thank you to point out this, and hopefully, this can be solved within the descheduler project. Meanwhile, the workaround provided by uniemimu works well as you commented above. Another option would be to use the previous descheduler version. Just a tip: In our case, in advertising in all nodes the extended resource we used value: 111 for the extender resource, only then did the descheduler start to take attention.
Hi togashidm, thanks for the tip. The value can actually be any positive integer but have to be big enough right? Please correct me if I am wrong.
yes, just big enough to get the effect.