platform-aware-scheduling Working with descheduler

Hi there,

I'm using the descheduler to rescedule a pod to another node. However, the descheduler complaints that every node has insufficient resource "telemetry/scheduling", which prevent the pod being rescheduled. (I've checked the source code of the descheduler, and it only evicts pods that don't fit the current node and fit some other nodes, see the code below from node_affinity.go)

pods, err := podutil.ListPodsOnANode(
	node.Name,
	getPodsAssignedToNode,
	podutil.WrapFilterFuncs(podFilter, func(pod *v1.Pod) bool {
		return evictorFilter.Filter(pod) &&
			!nodeutil.PodFitsCurrentNode(getPodsAssignedToNode, pod, node) &&
			nodeutil.PodFitsAnyNode(getPodsAssignedToNode, pod, nodes)
	}),
)
if err != nil {
	klog.ErrorS(err, "Failed to get pods", "node", klog.KObj(node))
}

for _, pod := range pods {
	if pod.Spec.Affinity != nil && pod.Spec.Affinity.NodeAffinity != nil && pod.Spec.Affinity.NodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution != nil {
		klog.V(1).InfoS("Evicting pod", "pod", klog.KObj(pod))
		if _, err := podEvictor.EvictPod(ctx, pod, node, "NodeAffinity"); err != nil {
			klog.ErrorS(err, "Error evicting pod")
			break
		}
	}
}

I'm using the setting up of health-metric-demo. The logs of the descheduler are like:

I0628 14:32:11.838531   72888 node_affinity.go:78] "Processing node" node="minikube-m02"
I0628 14:32:11.838554   72888 node.go:183] "Pod does not fit on node" pod="default/demo-app-77fbd8745b-hhsxl" node="minikube-m02"
I0628 14:32:11.838557   72888 node.go:185] "insufficient telemetry/scheduling"
I0628 14:32:11.838568   72888 node.go:166] "Pod does not fit on node" pod="default/demo-app-77fbd8745b-hhsxl" node="minikube"
I0628 14:32:11.838571   72888 node.go:168] "insufficient telemetry/scheduling"
I0628 14:32:11.838579   72888 node.go:166] "Pod does not fit on node" pod="default/demo-app-77fbd8745b-hhsxl" node="minikube-m02"
I0628 14:32:11.838582   72888 node.go:168] "insufficient telemetry/scheduling"
I0628 14:32:11.838591   72888 node.go:166] "Pod does not fit on node" pod="default/demo-app-77fbd8745b-hhsxl" node="minikube-m03"
I0628 14:32:11.838619   72888 node.go:168] "pod node selector does not match the node label"
I0628 14:32:11.838624   72888 node.go:168] "insufficient telemetry/scheduling"
I0628 14:32:11.839395   72888 descheduler.go:312] "Number of evicted pods" totalEvicted=0

In your instructions for the health-demo, the pod simply re-scheduled to another node, so I'm wondering how do you work around this problem?

Many thanks!

Jun 28 '22 12:06 hazxel

The health-demo predates the nodeFit feature of the descheduler. You might want to turn nodeFit off in your descheduler configuration, and see if that helps a bit. Based on a quick look at the nodeFit implementation in the descheduler it looks to me it doesn't care about the configuration of the scheduler, and as a result, it doesn't honor the fact that the resources in question are in fact configured as ignoredByScheduler

If my quick analysis above is correct, then the proper place for the issue would be at the descheduler project, which should in my opinion either honor the scheduler config somehow directly, and automatically ignore those resources (this would be nice indeed), or at least allow for configuring the same resources as ignoredByDescheduler.

As a workaround, if you need to keep the nodeFit feature enabled, you might resort to what I previously told you not to do: create the extended resource for the telemetry resource by hand (curl) for the nodes. The scheduler should still be configured to ignore the resource, so it won't be consumed.

Another issue which you may or may not stumble at with the descheduler node_affinity strategy is the fact that it will not deschedule anything unless there is another node ready where the pod could be scheduled to. Ref descheduler issue nr 640. Turning the nodeFit off won't help for that, unfortunately.

Jun 28 '22 13:06 uniemimu

Filed to the descheduler project, we'll see what the solution will be: https://github.com/kubernetes-sigs/descheduler/issues/863

Jun 28 '22 17:06 uniemimu

Thanks! I also noticed that the nodeFit flag is not taking effects, but good news is that posting the extended resource to all work nodes would solve the problem for now. Though it would be nice if the flag thing could be fixed later on.

Jun 29 '22 08:06 hazxel

Thanks! I also noticed that the nodeFit flag is not taking effects, but good news is that posting the extended resource to all work nodes would solve the problem for now. Though it would be nice if the flag thing could be fixed later on.

That flag is particularly confusing, as witnessed by descheduler issue number 845. Feel free to chime in, perhaps more people with the same expectations about the flag would make the maintainers reconsider the point that maybe turning the flag off should make some sort of a difference.

Jun 29 '22 08:06 uniemimu

hazxel, thank you to point out this, and hopefully, this can be solved within the descheduler project. Meanwhile, the workaround provided by uniemimu works well as you commented above. Another option would be to use the previous descheduler version. Just a tip: In our case, in advertising in all nodes the extended resource we used value: 111 for the extender resource, only then did the descheduler start to take attention.

Jun 29 '22 10:06 togashidm

Hi togashidm, thanks for the tip. The value can actually be any positive integer but have to be big enough right? Please correct me if I am wrong.

Jun 29 '22 11:06 hazxel

yes, just big enough to get the effect.

Jun 29 '22 12:06 togashidm