Operator goes into an infinite loop when trying to upgrade a single node cluster
When running an e2e test with just a 1 node cluster the decommission logic will always return an error not allowing the other actors to work to reconcile the clusters state.
[487 / 488] Testing //e2e/upgrades:go_default_test; 121s darwin-sandbox
logger.go:130: 2021-05-31T20:16:53.537Z INFO reconciling CockroachDB cluster {"CrdbCluster": "crdb-test-dk8jdq/crdb"}
logger.go:130: 2021-05-31T20:16:53.537Z INFO Running action with index: 0 and name: Decommission {"CrdbCluster": "crdb-test-dk8jdq/crdb"}
logger.go:130: 2021-05-31T20:16:53.537Z WARN check decommission oportunities {"action": "decommission", "CrdbCluster": "crdb-test-dk8jdq/crdb"}
logger.go:130: 2021-05-31T20:16:53.537Z ERROR We cannot decommission if there are less than 3 nodes {"action": "decommission", "CrdbCluster": "crdb-test-dk8jdq/crdb", "nodes": 1, "error": "decommission with less than 3 nodes is not supported", "errorVerbose": "decommission with less than 3 nodes is not supported\ngithub.com/cockroachdb/cockroach-operator/pkg/actor.decommission.Act\n\tpkg/actor/decommission.go:96\ngithub.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n\tpkg/controller/cluster_controller.go:130\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99\nruntime.goexit\n\tGOROOT/src/runtime/asm_amd64.s:1371"}
Is what is printed out to the logs until the test times out. I think the issue is that we moved to always run decommission first now and decommission.go#L95-L99 fires.
Reproduction Steps
- Edit any test (the one I ran was TestUpgradesMinorVersion) change the cluster to be 1 node instead of 3 here
- Run
make test/e2e/kind-upgrades - Test will timeout /cc @keith-mcclellan @alinadonisa
we didn't implement cockroach start-single-node so I think we need to add some CR validation to make sure nodes >=3 at all times until we can go back and do that work.
What's the lowest drag way to do that @chrislovecnm ? without adding an admission controller and/or a web hook at the moment
@keith-mcclellan does the team have any plans to implement start-single-node for the operator?
I would also be interested in this feature. It would be very interesting for staging and feature deployments where one does not want to create an entire cluster for testing,
I am also interested in this. As I'd like to start using cockroachdb to allow for future scaling, but not every single one of my applications require a distributed system at the moment.
Any news on this? It's been 2 years since this issue was opened.