helm-operator-plugins icon indicating copy to clipboard operation
helm-operator-plugins copied to clipboard

Allow marking releases stuck in a pending state as failed

Open SimonBaeumer opened this issue 4 years ago • 6 comments

PR to discuss how to recover from pending state. This is our current implementation used in https://github.com/stackrox/helm-operator

Fixes #94

Currently if the Helm releases exits unexpectedly (i.e. due to node crash or a bug) the pending state of a Helm release is never released and leads to an infinite reconciliation loop.

To automatically resolve pending states the reconciler now takes an option via WithMarkFailedAfter which configures a timeout after the pending state is handled as a failure.

ToDo

  • [x] Add Tests
  • [ ] Only mark failed on Operator owned Helm secrets

SimonBaeumer avatar Nov 16 '21 15:11 SimonBaeumer

Pull Request Test Coverage Report for Build 1593318507

  • 29 of 44 (65.91%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.5%) to 88.06%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/client/actionclient.go 0 7 0.0%
pkg/reconciler/reconciler.go 29 37 78.38%
<!-- Total: 29 44
Totals Coverage Status
Change from base Build 1579656302: -0.5%
Covered Lines: 1652
Relevant Lines: 1876

💛 - Coveralls

coveralls avatar Nov 16 '21 15:11 coveralls

@joelanford @varshaprasad96 @fabianvf On the only recover from operator owned secrets-topic. I think it does not make sense to limit the recovering of pending-states to operator owned Helm secrets because desired states the operator tries to match. In example looking at this workflow:

  1. Install CR with operator
  2. Helm release is installed by operator with CR values
  3. Admin installs a release manually, values are updated and do not match desired state from the CR anymore
  4. Reconciliation kicks in, compares the manifest diff and reconciles again
  5. Admin changes reverted by operator

Imho if a user wants to do manually upgrade their release the operator should be disabled.

WDYTH on this?

SimonBaeumer avatar Dec 17 '21 17:12 SimonBaeumer

@SimonBaeumer

Admin installs a release manually, values are updated and do not match desired state from the CR anymore

In this step, you're saying an admin runs helm upgrade on a release that is already being managed by a CR? And then that upgrade gets stuck in the pending state?

If so, I think your scenario makes sense, and I agree that the operator should reconcile back (by performing another upgrade) to the desired state as specified by the CR.

I was originally thinking of the reverse scenario:

  1. Admin installs a release manually, this release (or a subsequent upgrade) gets stuck in pending
  2. Create CR for the same release with operator
  3. Should operator adopt release and attempt to reconcile?

I was originally arguing that "no, it should not", but the more I think about it, I think I'm changing my mind. I could see 2 options for my scenario:

  1. Just block creation of the CR if a release secret already exists
  2. Adopt the release

Both of these options align with your scenario because the custom object would exist prior to the helm upgrade call. Theoretically both of these are valid depending on your perspective, but practically I think option 2 is easier because it doesn't involve building and shipping an admission webhook on behalf of a helm-operator author, which seems like it could be pretty difficult to orchestrate.

joelanford avatar Dec 17 '21 17:12 joelanford

@SimonBaeumer

Admin installs a release manually, values are updated and do not match desired state from the CR anymore

In this step, you're saying an admin runs helm upgrade on a release that is already being managed by a CR? And then that upgrade gets stuck in the pending state?

If so, I think your scenario makes sense, and I agree that the operator should reconcile back (by performing another upgrade) to the desired state as specified by the CR.

I was originally thinking of the reverse scenario:

  1. Admin installs a release manually, this release (or a subsequent upgrade) gets stuck in pending
  2. Create CR for the same release with operator
  3. Should operator adopt release and attempt to reconcile?

I agree. An operator ideally should only reconcile resources which belong to them or which are explicitly labeled to allow it (opt-in to reconcile).

I was originally arguing that "no, it should not", but the more I think about it, I think I'm changing my mind. I could see 2 options for my scenario:

  1. Just block creation of the CR if a release secret already exists
  2. Adopt the release

Both of these options align with your scenario because the custom object would exist prior to the helm upgrade call. Theoretically both of these are valid depending on your perspective, but practically I think option 2 is easier because it doesn't involve building and shipping an admission webhook on behalf of a helm-operator author, which seems like it could be pretty difficult to orchestrate.

I think adopting the release is reasonable if opted-in. This seems to be outside of the scope of this PR, created an issue for it: https://github.com/operator-framework/helm-operator-plugins/issues/144

SimonBaeumer avatar Dec 20 '21 13:12 SimonBaeumer

@joelanford and @SimonBaeumer to pair on this

asmacdo avatar Feb 08 '22 16:02 asmacdo

@SimonBaeumer: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci[bot] avatar Mar 07 '22 06:03 openshift-ci[bot]

Closing as stale. Please re-open if necessary.

perdasilva avatar Sep 17 '24 11:09 perdasilva