controller-runtime Dynamic controller scaling

Currently we allow to specify a fixed number of nodes for each controller.

After attending the talk at KubeCon on how to scale Cluster API to 2k clusters (link tba), it's be good to allow controller runtime to spin up and down workers dynamically based on objects in the queue, and on the 90th percentile of the overall duration of the reconciler.

Nov 07 '23 22:11 vincepri

Can you describe it in more detail?

Nov 08 '23 18:11 sqbi1024

I think there are two tasks

Change the reconciler's worker numbers at runtime.
Implement a built-in backpressure/auto-scaling mechanism based on metrics [1].

Nov 13 '23 03:11 halfcrazy

/kind feature

Nov 13 '23 14:11 troy0820

I'm skeptical whether changing the number of workers during runtime is a good idea. I always considered the number of workers to be the "size" of a controller – somewhat related to resource requests/limits. Increasing the number of workers without increasing its requests/limits might cause the process to be throttled, i.e., it might not help in increasing the controller's capacity/throughput.

Instead, I suggest looking into horizontally scaling controllers including some form of sharding. I explored the idea in this project: https://github.com/timebertt/kubernetes-controller-sharding

Nov 27 '23 08:11 timebertt

I'm skeptical whether changing the number of workers during runtime is a good idea.

Like any other change we usually propose, this would be opt in.

Instead, I suggest looking into horizontally scaling controllers including some form of sharding.

Controller Runtime is focused on a single controller scenario acting as a leader for the time being; but this is probably good to document outside of this project.

Nov 27 '23 17:11 vincepri

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 25 '24 17:02 k8s-triage-robot

/lifecycle frozen

Feb 26 '24 22:02 vincepri

Hi @vincepri,

We’re seeing similar issues with Spark Operator—event spikes overwhelm the controller, causing high latencies and timeouts for our time-sensitive batch workloads.

Are you proposing dynamically adjusting MaxConcurrentReconciles based on queue depth and reconciliation latency, or modifying controller thread scaling more broadly? Would love to understand the approach and potentially contribute in this area

Mar 04 '25 01:03 shubhM13

I would just use #2374

Mar 04 '25 06:03 sbueringer

Thanks @sbueringer - to use that feature, is that just a boolean set, while initializing the controller ? or we also are expected to define priority levels, and handle priority assignments to events ?

Apr 20 '25 05:04 shubhM13

@shubhM13 you only have to enable the feature as described in the issue description here: https://github.com/kubernetes-sigs/controller-runtime/issues/2374

May 08 '25 05:05 sbueringer

Hi, I am looking for a solution for controller scaling, which reconciling 2k+ CRD objects and I knew timebertt/kubernetes-controller-sharding in KubeCon Japan.

I am wondering if the feature can be implemented in controller-runtime and I reached this issue.

For example, it's just an idea but if controller runtime has options for sharding mode and if it can be enabled automatically when it detects the multiple number of controller in leader election lock logic, I think this will help both controller developers and users because we can simply configure HorizontalPodAutoscaler for the controller.

Controller Runtime is focused on a single controller scenario acting as a leader for the time being; but this is probably good to document outside of this project.

I have not deepdived and I didn't know the philosophy and high hurdle or not, but is there a possibility to consider it in the future?

Jun 26 '25 00:06 jlandowner