proposals Decouple CA and end-entity certificate management

This proposal describes how to create better separation between managing the self-signed CA that Strimzi owns and end-entity certificates. It is based on PR #46 which was closed. It proposes a smaller change but still working towards the wider goal of making it possible to use a tool like Cert Manager in Strimzi

Apr 24 '24 10:04 katheris

Have you thought much about how this interacts with https://github.com/strimzi/proposals/pull/117

Apr 30 '24 23:04 tombentley

@tombentley Thanks for the review, I think I've addressed most of your comments. Take a look and see if you are happy for me to change this to a full PR.

On PR #117 although it's a similar space I don't think they affect each other much, since that one is more about how the user specifies the certificates should be given to Strimzi, whereas this is more focused on internals.

May 01 '24 15:05 katheris

Hi @scholzj , thanks for the review. In general looking at your comments I have definitely skipped over how this works when users bring their own CA, so I'll definitely update to include that and see if I can make the proposal clearer in general. On the specific things you called out..

You talk about CertManager or Vault as motivation, but the proposal does not relate to them in any way. For Cert Manager, I would expect some detailed API changes and some descriptions of the user workflows.

This change doesn't directly include API changes for using CertManager, however the changes are designed to make it easier to bring that in. I chose not to create a proposal that describes the complete picture because I think that makes it harder to reach agreement (as there are a lot of elements to discuss) and if we design everything upfront then we cannot account for the unknowns that we would discover along the way. Instead I have tried to identify a change that is useful in understanding the state of a cluster better, but also sets us up for the CertManager integration later. If you think it would be better to describe everything in one big proposal, or open multiple proposals at once to cover the whole thing I can, but I'm aware that the previous big proposal never got consensus, even though it was open for years.

The single root CA thing ... I'm not sure it is as presented in the motivation. I'm not aware of any requirement for using an arbitrary number of Cluster CAs. The logic is simple -> you always want to use a single CA and you want to allow transitions between one CA to another CA. That is supported today.

If you are using Cert Manager then you would request each certificate individually, and I don't think you can guarantee that each one is signed by the same CA, since that process is handled by Cert Manager.

The actual proposal seems to outline a kind of a state machine for CA renewals or replacement as a kind of internal refactoring I guess? If that is the goal of this proposal, that is fine. But:

I think it lacks the depth about how it ensures the reliability of the process.

It lacks the explanation how existing clusters with Strimzi / Custom CAs will be handled

It lacks explanation how CA or server cert renewal will be handled (versus CA replacement)

One has to also wonder if the external integration with Cert Manager should be clarified to make sure it fits the new model.

Some of the other things worth clarifying ...

Impact on the listener certificates? (Maybe there is none, but worth mentioning how that fits in)

I'll work on some changes to address these better

May 14 '24 09:05 katheris

You talk about CertManager or Vault as motivation, but the proposal does not relate to them in any way. For Cert Manager, I would expect some detailed API changes and some descriptions of the user workflows.

This change doesn't directly include API changes for using CertManager, however the changes are designed to make it easier to bring that in. I chose not to create a proposal that describes the complete picture because I think that makes it harder to reach agreement (as there are a lot of elements to discuss) and if we design everything upfront then we cannot account for the unknowns that we would discover along the way. Instead I have tried to identify a change that is useful in understanding the state of a cluster better, but also sets us up for the CertManager integration later. If you think it would be better to describe everything in one big proposal, or open multiple proposals at once to cover the whole thing I can, but I'm aware that the previous big proposal never got consensus, even though it was open for years.

I do not think you need one big proposal for everything. But I think it is important to link things properly and explain the motivation. If you say in the proposal that CertManager integration is one of the motivations, it is I think important to understand how it relates to it and to be honest, I do not understand that. And in this case, it seems important part of the value of the changes proposed here.

My expectation is that the CertManager integration will work the way that the user points Strimzi to Cert Manager Issuer/ClusterIssue and Strimzi will create the Certificate resources to get certificates from that issuer. I might not be an expert on this, but I do not see that producing an arbitrary number of CAs and having every certificate signed by a different CA. There will be just renewals / key replacements to handle as we have today and as are handled by the current mechanism.

I might be wrong with how it works or I might be missing some important aspects. But the main point I'm trying to make is that the proposal seems to take it as a well known fact that this is needed without explaining why. And at least for me, it is not clear and should be explained.

The single root CA thing ... I'm not sure it is as presented in the motivation. I'm not aware of any requirement for using an arbitrary number of Cluster CAs. The logic is simple -> you always want to use a single CA and you want to allow transitions between one CA to another CA. That is supported today.

If you are using Cert Manager then you would request each certificate individually, and I don't think you can guarantee that each one is signed by the same CA, since that process is handled by Cert Manager.

You do request them individually. But you request them from an issuer. And my understanding is that the issue defines the CA. Again, maybe this just needs more details because it is not clear what made you make the jump from usign CertManager to the need for an arbitrary number of CAs.

May 14 '24 16:05 scholzj

@tombentley @scholzj @tinaselenge thanks for your reviews. I've made some updates, take a look and see what you think. The fundamental things that have changed are:

I've clarified the roles of the different reconciler classes and when exactly the state transitions happen
I've updated to have the new shared secret be the one that is directly used for trust
I've added more background of the current flow as I understand it.

The things I don't think I've addressed yet are:

The specific flow for the User Operator
The flow when a user specifies their own listener certificates. For both of these I'm wondering whether we should actually restrict this proposal to only affect the cluster CA and leave the clients CA as is, let me know what you think.

@ppatierno once you're back if you could also take a look that would be great. I'd be particularly interested in you reviewing the current situation section as I think you implemented the logic around the generation annotations and the rollout of trust when the CA key changes.

May 20 '24 12:05 katheris

The PR proposes changes to the handling of CA certs to facilitate the integration of tools such as cert manager. I think it would be useful to consider at least at a high level how the overall solution will work before getting into the details. The details in the PR make some assumptions about the wanted behaviour and is perhaps also influenced by the existing solution which needs to take care of all considerations due to the desire to be secure out of box. In refactoring the solution I think it is good to weigh up the pros and cons of the possible approaches and make conscious decisions on the direction to take which will then feed into the details.

Where should the CA certs for trust be sourced? Options:
- From the cert manager response
  - Cert manager returns from the request to create a cert, both the cert itself and the CA that can be used to establish trust.
  - The cert manager documentation states this is a best effort and is not necessarily correct. I'm not familiar with the details but I guess its possibly dependent on the issuer.
  - Do other alternatives to cert manager have similar behaviour, or will this be a problem for using other solutions
- Configuration
  - The CA can be provided by the user, similar to is already done today when using an "own CA cert"
Where should responsibility lie for ensuring, on an ongoing basis, the CA certs used to establish trust are up to date? Options:
- Strimzi
  - In the existing code, this responsibility lies with Strimzi for the cluster operator generated CA. The PR seems to aim for this responsibility lying in Strmzi for the external CAs from cert manager. While it is required for this to be handled by Strimzi for the cluster operator generated CA (as we want secure out of the box) it is not necessarily required to be done by Strimzi in other scenarios
- User
  - In the existing code, this responsibility lies with the user for the 'own CA' option. I believe it should at least be considered that it is handled as per the 'own CA' option which would eliminate the need for this new secret and state machine, particular when you consider root certs are usually long lived.
A more general question on the overall solution rather than this PR specifically: How should we integrate with cert- manager? Options:
- Integrate with cert-manager itself
  - Create the custom resources programmatically, through an implementation of a new interface to facilitate alternatives to cert manager
  - This is what is proposed in the older referenced PR
- Integrate with the outputs of cert manager (i.e. the generated certs)
  - Use the secrets/certs generated by cert manager rather than interacting with cert manager itself
  - Already supported by Strimzi for TLS listener certs through the brokerCertChainAndKey config option https://strimzi.io/blog/2021/05/07/deploying-kafka-with-lets-encrypt-certificates/
  - I think this alternative approach should at least be considered

May 24 '24 16:05 MichaelMorrisEst

@scholzj @ppatierno @PaulRMellor @tinaselenge @fvaleri Thanks for all your reviews, I've recently pushed some updates to hopefully address a lot of the comments raised. The main changes are:

Introduce two new Kubernetes Secrets rather than 1 to provide better separation and adjust proposal accordingly/
Apply wording suggestions.
Add more detail to current situation section to make it clearer how things work today.
Highlight new steps in the flow in bold to highlight where things will differ from today.

When you get time I'd appreciate you taking another look at the proposal

Jun 10 '24 10:06 katheris

Thanks @katheris, I'll try to do another more detailed review in the next few days.

Thanks @tombentley, I've addressed your initial comments

Jun 10 '24 10:06 katheris

Hi @MichaelMorrisEst, thanks for your comments, sorry it's taken me a bit of time to get to replying. I've recently pushed some changes to the proposal so feel free to take another read.

I agree that I have made some assumptions about how it would work and I'll address the specifics below, but generally my aim is that this proposal starts making it easier for us to refactor things internally. Some of the assumptions I have made are based on community discussions on this topic and debate on the original proposal https://github.com/strimzi/proposals/pull/46, which despite never being accepted did have quite a bit of discussion.

I have avoided opening a proposal about the overall approach on integrating tools like cert-manager, because in the past when we did that (see PR 46) it was very hard to get consensus, since it is such a complicated thing to implement. The hope is that this proposal describes something that is clear how we would implement it today and is a first step towards supporting those other tools. This should allow us to make progress on this feature without having to do all the design upfront, then having to redesign it when we come across the unknown unknowns that will surely crop up.

In answer to your specific points:

Where should the CA certs for trust be sourced?

I believe the updated proposal no longer makes an assumption on the answer to this question.

Where should responsibility lie for ensuring, on an ongoing basis, the CA certs used to establish trust are up to date?

In the existing code, when the user has chosen the 'own CA' option, Strimzi is still responsible for rolling the different components to trust the new CA certificate when the private key is replaced. That is still the case in this proposal since even if root certs are usually long lived, we do need to correctly handle replacements/renewals.

A more general question on the overall solution rather than this PR specifically: How should we integrate with cert- manager?

This is one where I am making an assumption, which is that Strimzi is responsible for requesting certificates it needs. The reason for this is that whenever anything changes in the cluster, e.g. scaling number of nodes, adding or removing listeners, enabling components like the topic operator, it is possible a new certificate will be required. Requiring user action (creating a certificate) before making any of these changes would make for a poor user experience in my opinion. It would also effectively make the communication between components a user facing implementation detail, making it harder to change in future. For these reasons I personally feel it's a reasonable assumption.

Jun 10 '24 10:06 katheris

Hi @katheris This is a step in the right direction from IBM's (Event Stream's) perspective. I have added a couple of review comments to your proposal.

Jun 12 '24 09:06 TJ5303

@tombentley @scholzj @ppatierno @fvaleri @tinaselenge thanks for your review comments. When you have time it would be great if you could take another look to see if we can get this closer to approval.

Jul 03 '24 08:07 katheris

Leaving aside the motivation, I think we really need to better understand from the proposal:

What will be the format of the trust state and how will it be used during rolling, in particular to allow to continue rolling updates after they were interupted
What will be the upgrade / downgrade

Jul 18 '24 08:07 scholzj

Leaving aside the motivation, I think we really need to better understand from the proposal:

What will be the format of the trust state and how will it be used during rolling, in particular to allow to continue rolling updates after they were interupted

What will be the upgrade / downgrade

@scholzj thanks for the review, I've addressed a lot of the comments. I'll work on adding some more example walkthroughs to illustrate how it would handle downgrade and rolling that takes into account interruptions. Are there any particular interruptions you see as most likely, or just in general at each step understanding what would happen if the operator pod was bounced?

Jul 22 '24 11:07 katheris

Leaving aside the motivation, I think we really need to better understand from the proposal:

What will be the format of the trust state and how will it be used during rolling, in particular to allow to continue rolling updates after they were interupted

What will be the upgrade / downgrade

@scholzj thanks for the review, I've addressed a lot of the comments. I'll work on adding some more example walkthroughs to illustrate how it would handle downgrade and rolling that takes into account interruptions. Are there any particular interruptions you see as most likely, or just in general at each step understanding what would happen if the operator pod was bounced?

The operator can be killed at any point in time of the process. Similarly, the reconciliation might fail at many points of the process - especially the rolling updates are critical. So it needs to be able to recover from these situation and ideally (as in really really required 😉) not start from the beginning again.

Jul 22 '24 12:07 scholzj

I've pushed a couple of changes to this proposal, so please do re-review when you have the time. The key changes are:

Updating the states to change TRUSTED_IN_USE state to TRUSTED_IN_USE_ANY and adding state TRUSTED_IN_USE_ALL.
Expanding flow examples to be more explicit and adding specific section that describes state changes.
Updating diagrams to include more details and adding new state diagram.
Updating upgrade flow and adding downgrade flow.

Aug 15 '24 16:08 katheris

@scholzj I think these changes should address most of your previous comments.

Aug 15 '24 16:08 katheris

But the main problem is that the Proposal section does not seem to address any single point from the Motivation. The only benefit I can see is that we probably safe one list pods request and replace it with one get secret call. But that is not something we struggle with nor something what would justify this change.

@scholzj thanks for your review, on the above point, in the motivation in the proposal for completeness I listed all the things that will need changing in the way Strimzi works in order to integrate with an external certificate management tool. This proposal focuses on the parts that are related to how we trust the CA certificate, and does not address issuing certificates. I included a "Future changes" section listing future to explain how I would expect that to be addressed, but obviously a separate proposal will be needed to elaborate on those points and get agreement.

So to be clear on how I think the proposal does/doesn't address the motivation, from the proposal:

Addressed by the proposal

Strimzi cannot guarantee that the issuer will issue every certificate signed by the same root CA certificate.

The new trusted-certs Kubernetes Secret allows us to store and identify multiple certificates, whereas the existing cluster-ca-cert certificate stores only two - the current one called ca.crt, and the previous one called ca-YYYY-MM-DDTHH-MM-SSZ.crt

The state of certificates, particularly CA certificates is determined based on multiple different annotations and fields on internal Java objects. This makes it hard to comprehend the current state, particularly when doing a renewal and tricky to reason about edge cases that might occur.

This proposal addresses this by adding in the new trusted-certs Secret.

Partially addressed by the proposal

These solutions cannot be responsible for rolling updates to components like Kafka.

This means that end-entity certificate generation and the rollout of trust in the CA certificates referenced in those end-entity certificates, needs to be decoupled.

The proposal starts to address this by no longer relying on the ca-key-generation annotation. Currently the component reconcilers handle that annotation, which doesn't make sense if we want to have the CA reconciler focus on the role out of trust and the component reconcilers focus on requesting end-entity certificates.

These solutions might take a varying amount of time to issue a certificate.

Since we do not know how long certificate issuance might take, we can't block the reconcile loop waiting for certificates to be issued.

This means it can't be assumed that e.g. the CA certificate is already present when the component reconcilers run (since it not be issued yet).

Similar to the previous point this proposal starts to address this by no longer relying on the component reconcilers completing their reconcile actions to know if the pods trust the CA. However this proposal doesn't change the way certificate issuance is handled. I listed that under "Future changes".

Not addressed by the proposal

The component reconcilers directly interact with the Java class that represents the Cluster CA. For example, this class contains methods not only to manage the CA certificate and private key, but also generate certificates for Strimzi components (like Cruise Control) and store state about renewals that happened during the current reconcile loop. This makes it hard to drop in an alternative to the existing Strimzi Cluster CA, since it's behaviour is relied upon throughout the reconcile loop.

This is not addressed by the proposal, since it is related to certificate issuance.

Sep 12 '24 09:09 katheris