cluster-api ✨Simplify alternative control-plane implementation

What would you like to be added (User Story)?

As a developer I’d like a better separation of concerns for control-plane provider implementation and have an ability to reuse general purpose logic located in the kubeadm provider implementation.

Detailed Description

Currently provided approaches for cluster bootstrap and control plane management (kubeadm) include non-generic elements, and alternative control-plane providers (CAPBRKE2) have to copy parts of the original implementation.

An example of this is etcd member management, required for correct scaling-down operations with control-plane nodes while maintaining cluster overall health. It is marked as optional in the original proposal, but absent etcd membership management/or leader removal causes etcd to fail (downstream issue with details.)

Anything else you would like to add?

Example of the logic which provider code may want to reuse include:

ETCD membership management
Pre-flight checks

Label(s) to be applied

/kind feature /area bootstrap One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

Feb 26 '24 12:02 Danil-Grigorev

/triage needs-discussion

While I see value in reusing experiences across projects and possibly having some more help in KCP which is traditionally on the shoulders of one or two maintainers, I'm a little bit concerned about two angles of this discussion:

Exposing the internals of KCP as a public method will impact KCP's capability to react to issues or evolve.

e.g. a fix like https://github.com/kubernetes-sigs/cluster-api/pull/10154 (the last KCP fix I recently reviewed) will probably require two minors to go through deprecation, while today we can easily include this in a patch release. Similarly, due to guarantee on public methods signature, in most cases back-port won't be possible anymore.

That's a lot to give up for a piece of code so critical in CAPI, unless we find a way to preserve flexibility and speed.

This looks like a major refactor

I don't have a clear idea of what is "general-pourpose" and what is not; this makes it difficult for me to visualize how we should refactor the code to achieve the goal.

If I stick just to the example of "etcd member management" is something that spans all across the codebase (scale up, down, upgrade, remediation, conditions etc). This would imply a major refactor across the entire KCP codebase.

That means that we probably need a more in-depth analysis and an incremental plan that allows us to manage the risk and impacts comfortably.

Let's wait for opinions from other core maintainers as well

Feb 26 '24 20:02 fabriziopandini

/priority backlog

Apr 11 '24 16:04 fabriziopandini

This issue is currently awaiting triage.

CAPI contributors will take a look as soon as possible, apply one of the triage/* labels and provide further guidance.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 18 '24 13:04 k8s-ci-robot