cockroach-operator icon indicating copy to clipboard operation
cockroach-operator copied to clipboard

Operator goes into an infinite loop when trying to upgrade a single node cluster

Open udnay opened this issue 4 years ago • 5 comments

When running an e2e test with just a 1 node cluster the decommission logic will always return an error not allowing the other actors to work to reconcile the clusters state.

[487 / 488] Testing //e2e/upgrades:go_default_test; 121s darwin-sandbox
    logger.go:130: 2021-05-31T20:16:53.537Z     INFO    reconciling CockroachDB cluster {"CrdbCluster": "crdb-test-dk8jdq/crdb"}
    logger.go:130: 2021-05-31T20:16:53.537Z     INFO    Running action with index: 0 and  name: Decommission    {"CrdbCluster": "crdb-test-dk8jdq/crdb"}
    logger.go:130: 2021-05-31T20:16:53.537Z     WARN    check decommission oportunities {"action": "decommission", "CrdbCluster": "crdb-test-dk8jdq/crdb"}
    logger.go:130: 2021-05-31T20:16:53.537Z     ERROR   We cannot decommission if there are less than 3 nodes   {"action": "decommission", "CrdbCluster": "crdb-test-dk8jdq/crdb", "nodes": 1, "error": "decommission with less than 3 nodes is not supported", "errorVerbose": "decommission with less than 3 nodes is not supported\ngithub.com/cockroachdb/cockroach-operator/pkg/actor.decommission.Act\n\tpkg/actor/decommission.go:96\ngithub.com/cockroachdb/cockroach-operator/pkg/controller.(*ClusterReconciler).Reconcile\n\tpkg/controller/cluster_controller.go:130\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:297\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:252\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\texternal/io_k8s_sigs_controller_runtime/pkg/internal/controller/controller.go:215\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\texternal/io_k8s_apimachinery/pkg/util/wait/wait.go:99\nruntime.goexit\n\tGOROOT/src/runtime/asm_amd64.s:1371"}

Is what is printed out to the logs until the test times out. I think the issue is that we moved to always run decommission first now and decommission.go#L95-L99 fires.

Reproduction Steps

  • Edit any test (the one I ran was TestUpgradesMinorVersion) change the cluster to be 1 node instead of 3 here
  • Run make test/e2e/kind-upgrades
  • Test will timeout /cc @keith-mcclellan @alinadonisa

udnay avatar May 31 '21 20:05 udnay

we didn't implement cockroach start-single-node so I think we need to add some CR validation to make sure nodes >=3 at all times until we can go back and do that work.

What's the lowest drag way to do that @chrislovecnm ? without adding an admission controller and/or a web hook at the moment

keith-mcclellan avatar Jun 01 '21 17:06 keith-mcclellan

@keith-mcclellan does the team have any plans to implement start-single-node for the operator?

camertron avatar Mar 30 '22 03:03 camertron

I would also be interested in this feature. It would be very interesting for staging and feature deployments where one does not want to create an entire cluster for testing,

jhoelzel avatar Aug 30 '22 14:08 jhoelzel

I am also interested in this. As I'd like to start using cockroachdb to allow for future scaling, but not every single one of my applications require a distributed system at the moment.

AdamJSoftware avatar Dec 28 '22 21:12 AdamJSoftware

Any news on this? It's been 2 years since this issue was opened.

AdamJSoftware avatar May 10 '23 13:05 AdamJSoftware