operator-sdk `run bundle-upgrade` contains a race condition

Bug Report

What did you do?

Run bundle-upgrade using the Operator SDK CLI in a setup with lots of latencies.

What did you expect to see?

The bundle upgrade working as expected.

What did you see instead? Under which circumstances?

The bundle did not upgrade because the new CSV did not appear. The new catalog was empty.

An investigation revealed that the right catalog pods where created and running. However in the upgrade pod containing the upgrade catalog contained no yaml files under /configs.

The pod reported a mounted CM and the CM contained the expected data though.

Looking at the timestaps the CM was created after the pod was started. This means the pod must have mounted the old config map created during run bundle. The deletion/replacement of that config map must then have lead to the removal of the FBC file.

The race conditions is thus between the replacement of the config map and the start of the new catalog pod.

Environment

Operator type:

/language go

Kubernetes cluster type:

A K3S cluster running via VCluster on GKE.

This is where the latency comes from. Resources are copied from the virtual cluster into GKE where the actual pods run.

$ operator-sdk version

operator-sdk version: "v1.22.0", commit: "9e95050a94577d1f4ecbaeb6c2755a9d2c231289", kubernetes version: "1.24.1", go version: "go1.18.3", GOOS: "linux", GOARCH: "amd64"

$ go version (if language is Go)

go version go1.18.3 linux/amd64

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.2", GitCommit:"f66044f4361b9f1f96f0053dd46cb7dce5e990a8", GitTreeState:"clean", BuildDate:"2022-06-17T22:28:26Z", GoVersion:"go1.18.3", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.4 Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.12-gke.1500", GitCommit:"6c11aec6ce32cf0d66a2631eed2eb49dd65c89f8", GitTreeState:"clean", BuildDate:"2022-05-11T09:25:37Z", GoVersion:"go1.16.15b7", Compiler:"gc", Platform:"linux/amd64"} WARNING: version difference between client (1.24) and server (1.21) exceeds the supported minor version skew of +/-1

Possible Solution

Two ideas:

delete the old CM before creating the new catalog pod (my current workaround for the issue)
add a content dependent suffix to the CM name so that the CMs are not reused on updates

Jul 06 '22 17:07 jeloba

@jeloba We modified some logic recently that changes the FBC root dir from "/configs" to fmt.Sprintf("/%s-configs", CatalogSource.Name), and this is where all the extra stuff for the FBC is supposed to end up.

To be able to replicate this issue, could you please provide the bundle images that you ran run bundle and run bundle-upgrade with? And also any relevant logs might be helpful. Thanks @jeloba!

Jul 11 '22 18:07 rashmigottipati

The bundles and the operator are proprietary. I can probably provide the bundle but not the operator for testing. Will that be of any help?

As far as I see it, the behavior is not dependent on the specific catalog or bundle images but on the latency of the underlying cluster in realizing the changes to the ConfigMap.

@rashmigottipati which logs would be relevant? I don't remember any error messages at all, except the timeout of bundle-upgrade while waiting for the CSV to become available.

Jul 12 '22 06:07 jeloba

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Oct 17 '22 01:10 openshift-bot

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

Nov 16 '22 08:11 openshift-bot

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Dec 17 '22 00:12 openshift-bot

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Dec 17 '22 00:12 openshift-ci[bot]