Keith
Keith
Hi @pornpoi this actually appears to be a database bug and not an operator bug (but that's not to say we can't do something in the operator to prevent this...
We discussed internally and we think the issue is that the default start pattern is one at a time, rather than all at once. We're looking into changing the default...
my diagnosis is that after we upgrade the first partition, we're trying to update an old version of the ss definition rather than pulling the latest to update the next...
the PVC gets deleted as well so there is no way to recover from this state.
Decommission should run as follows functionally: 1) Validate that the node count after decommission is still >= 3 (node decommissions CAN be run in parallel) 2) Run `cockroach node decommission`...
ref: https://www.cockroachlabs.com/docs/v21.1/cockroach-node.html ref 2: https://www.cockroachlabs.com/docs/v21.1/cockroach-node.html#flags ref 3: https://www.cockroachlabs.com/docs/v21.1/remove-nodes.html
> > Positive case 1 - > > After cockroach node decommission is run, health-checker should show 0 under-replicated ranges AND cockroach node status --decommission should show the node as...
` cockroach node status --decommission --certs-dir=certs --host= ` ``` id | address | build | started_at | updated_at | is_available | is_live | gossiped_replicas | is_decommissioning | is_draining +---+------------------------+---------+----------------------------------+----------------------------------+--------------+---------+-------------------+--------------------+-------------+ 1...
I'm not questioning that the cc drainer works properly, I'm questioning whether we implemented it properly. Something is stopping the pod before the decommission is complete...see ``` {"level":"warn","ts":1622218688.9847136,"logger":"action","msg":"reconciling resources on...
we didn't implement `cockroach start-single-node` so I think we need to add some CR validation to make sure nodes >=3 at all times until we can go back and do...