helm-operator-plugins Allow marking releases stuck in a pending state as failed

PR to discuss how to recover from pending state. This is our current implementation used in https://github.com/stackrox/helm-operator

Fixes #94

Currently if the Helm releases exits unexpectedly (i.e. due to node crash or a bug) the pending state of a Helm release is never released and leads to an infinite reconciliation loop.

To automatically resolve pending states the reconciler now takes an option via WithMarkFailedAfter which configures a timeout after the pending state is handled as a failure.

ToDo

[x] Add Tests
[ ] Only mark failed on Operator owned Helm secrets

Nov 16 '21 15:11 SimonBaeumer

Pull Request Test Coverage Report for Build 1593318507

29 of 44 (65.91%) changed or added relevant lines in 2 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.5%) to 88.06%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/client/actionclient.go	0	7	0.0%
pkg/reconciler/reconciler.go	29	37	78.38%
<!--	Total:	29	44

Totals
Change from base Build 1579656302:	-0.5%
Covered Lines:	1652
Relevant Lines:	1876

💛 - Coveralls

Nov 16 '21 15:11 coveralls

@joelanford @varshaprasad96 @fabianvf On the only recover from operator owned secrets-topic. I think it does not make sense to limit the recovering of pending-states to operator owned Helm secrets because desired states the operator tries to match. In example looking at this workflow:

Install CR with operator
Helm release is installed by operator with CR values
Admin installs a release manually, values are updated and do not match desired state from the CR anymore
Reconciliation kicks in, compares the manifest diff and reconciles again
Admin changes reverted by operator

Imho if a user wants to do manually upgrade their release the operator should be disabled.

WDYTH on this?

Dec 17 '21 17:12 SimonBaeumer

@SimonBaeumer

Admin installs a release manually, values are updated and do not match desired state from the CR anymore

In this step, you're saying an admin runs helm upgrade on a release that is already being managed by a CR? And then that upgrade gets stuck in the pending state?

If so, I think your scenario makes sense, and I agree that the operator should reconcile back (by performing another upgrade) to the desired state as specified by the CR.

I was originally thinking of the reverse scenario:

Admin installs a release manually, this release (or a subsequent upgrade) gets stuck in pending
Create CR for the same release with operator
Should operator adopt release and attempt to reconcile?

I was originally arguing that "no, it should not", but the more I think about it, I think I'm changing my mind. I could see 2 options for my scenario:

Just block creation of the CR if a release secret already exists
Adopt the release

Both of these options align with your scenario because the custom object would exist prior to the helm upgrade call. Theoretically both of these are valid depending on your perspective, but practically I think option 2 is easier because it doesn't involve building and shipping an admission webhook on behalf of a helm-operator author, which seems like it could be pretty difficult to orchestrate.

Dec 17 '21 17:12 joelanford

@SimonBaeumer

Admin installs a release manually, values are updated and do not match desired state from the CR anymore

In this step, you're saying an admin runs helm upgrade on a release that is already being managed by a CR? And then that upgrade gets stuck in the pending state?

If so, I think your scenario makes sense, and I agree that the operator should reconcile back (by performing another upgrade) to the desired state as specified by the CR.

I was originally thinking of the reverse scenario:

Admin installs a release manually, this release (or a subsequent upgrade) gets stuck in pending

Create CR for the same release with operator

Should operator adopt release and attempt to reconcile?

I agree. An operator ideally should only reconcile resources which belong to them or which are explicitly labeled to allow it (opt-in to reconcile).

I was originally arguing that "no, it should not", but the more I think about it, I think I'm changing my mind. I could see 2 options for my scenario:

Just block creation of the CR if a release secret already exists

Adopt the release

Both of these options align with your scenario because the custom object would exist prior to the helm upgrade call. Theoretically both of these are valid depending on your perspective, but practically I think option 2 is easier because it doesn't involve building and shipping an admission webhook on behalf of a helm-operator author, which seems like it could be pretty difficult to orchestrate.

I think adopting the release is reasonable if opted-in. This seems to be outside of the scope of this PR, created an issue for it: https://github.com/operator-framework/helm-operator-plugins/issues/144

Dec 20 '21 13:12 SimonBaeumer

@joelanford and @SimonBaeumer to pair on this

Feb 08 '22 16:02 asmacdo

@SimonBaeumer: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 07 '22 06:03 openshift-ci[bot]

Closing as stale. Please re-open if necessary.

Sep 17 '24 11:09 perdasilva