Dependency/ordering introduced over time may prevent disaster recovery

Open larhauga opened this issue 3 years ago • 0 comments

Issues

Configuration drift and missing validation of dependency chain.

After moving to flux with SSA we need to restructure all our configuration to support a perfect form of dependency tree. We see that this will cause problems when recreating clusters, due to configuration drift. When configuration changes over time, the cluster state is right now important for a successful apply.

We belive that with the current implementation, there is a large risk that a new dependency chain has been introduced over time, and this will only be discovered during a disaster recovery, and will only increase the time to recover and require refactoring og the configuration during the recovery procedure.

Custom Resources with ordering

Currently 18 resources have special case handling in flux, but kubernetes is an extensible API and there should not be any special case handling of some resources that are not available for custom resources.

Namespace is such a resource https://github.com/fluxcd/pkg/blob/e693be5bc5f7759d08d0d0e09c02cd7882514e63/ssa/sort.go#L35-L60.

For instance, it is not possible to have a custom resource that create namespaces, and have this as a part of the same Kustomization as dependent resources. While this can be solved by using another flux Kustomization, this results in a lot of overhead configuration.

We see that the same issue was being discussed in flux v1 (https://github.com/fluxcd/flux/issues/2758)

Possible solutions

Sorting/ordering of CRDs

Custom sorting parameters on a flux Kustomization
Sorting/priority labels on objects
Sorting/priority labels on CRDs.

Eventual apply

With earlier versions of flux, the configuration would eventually be applied. This solved the problems of having to explicitly define a dependency chain. I think this is more representative of the behavior of kubernetes.

We would like the option to toggle a Kustomization to allow for failures, and apply as much as it can anyways. If it was possible to switch between "abort-on-first-failure" and "continue-on-failure", this would reduce the complexity in our configuration.

As we use crossplane, we end up have a complex chain of CRDs which create new CRDS. The current behavior results in us having multiple Kustomizations which only have one resource, to model a dependency chain which would "just work" with eventual apply.

May 30 '22 08:05 larhauga