Allow marking releases stuck in a pending state as failed
PR to discuss how to recover from pending state. This is our current implementation used in https://github.com/stackrox/helm-operator
Fixes #94
Currently if the Helm releases exits unexpectedly (i.e. due to node crash or a bug) the pending state of a Helm release is never released and leads to an infinite reconciliation loop.
To automatically resolve pending states the reconciler now takes an option via WithMarkFailedAfter which configures a timeout after the pending state is handled as a failure.
ToDo
- [x] Add Tests
- [ ] Only mark failed on Operator owned Helm secrets
Pull Request Test Coverage Report for Build 1593318507
- 29 of 44 (65.91%) changed or added relevant lines in 2 files are covered.
- No unchanged relevant lines lost coverage.
- Overall coverage decreased (-0.5%) to 88.06%
| Changes Missing Coverage | Covered Lines | Changed/Added Lines | % |
|---|---|---|---|
| pkg/client/actionclient.go | 0 | 7 | 0.0% |
| pkg/reconciler/reconciler.go | 29 | 37 | 78.38% |
| <!-- | Total: | 29 | 44 |
| Totals | |
|---|---|
| Change from base Build 1579656302: | -0.5% |
| Covered Lines: | 1652 |
| Relevant Lines: | 1876 |
💛 - Coveralls
@joelanford @varshaprasad96 @fabianvf On the only recover from operator owned secrets-topic.
I think it does not make sense to limit the recovering of pending-states to operator owned Helm secrets because desired states the operator tries to match.
In example looking at this workflow:
- Install CR with operator
- Helm release is installed by operator with CR values
- Admin installs a release manually, values are updated and do not match desired state from the CR anymore
- Reconciliation kicks in, compares the manifest diff and reconciles again
- Admin changes reverted by operator
Imho if a user wants to do manually upgrade their release the operator should be disabled.
WDYTH on this?
@SimonBaeumer
Admin installs a release manually, values are updated and do not match desired state from the CR anymore
In this step, you're saying an admin runs helm upgrade on a release that is already being managed by a CR? And then that upgrade gets stuck in the pending state?
If so, I think your scenario makes sense, and I agree that the operator should reconcile back (by performing another upgrade) to the desired state as specified by the CR.
I was originally thinking of the reverse scenario:
- Admin installs a release manually, this release (or a subsequent upgrade) gets stuck in pending
- Create CR for the same release with operator
- Should operator adopt release and attempt to reconcile?
I was originally arguing that "no, it should not", but the more I think about it, I think I'm changing my mind. I could see 2 options for my scenario:
- Just block creation of the CR if a release secret already exists
- Adopt the release
Both of these options align with your scenario because the custom object would exist prior to the helm upgrade call. Theoretically both of these are valid depending on your perspective, but practically I think option 2 is easier because it doesn't involve building and shipping an admission webhook on behalf of a helm-operator author, which seems like it could be pretty difficult to orchestrate.
@SimonBaeumer
Admin installs a release manually, values are updated and do not match desired state from the CR anymore
In this step, you're saying an admin runs
helm upgradeon a release that is already being managed by a CR? And then that upgrade gets stuck in the pending state?If so, I think your scenario makes sense, and I agree that the operator should reconcile back (by performing another upgrade) to the desired state as specified by the CR.
I was originally thinking of the reverse scenario:
- Admin installs a release manually, this release (or a subsequent upgrade) gets stuck in pending
- Create CR for the same release with operator
- Should operator adopt release and attempt to reconcile?
I agree. An operator ideally should only reconcile resources which belong to them or which are explicitly labeled to allow it (opt-in to reconcile).
I was originally arguing that "no, it should not", but the more I think about it, I think I'm changing my mind. I could see 2 options for my scenario:
- Just block creation of the CR if a release secret already exists
- Adopt the release
Both of these options align with your scenario because the custom object would exist prior to the
helm upgradecall. Theoretically both of these are valid depending on your perspective, but practically I think option 2 is easier because it doesn't involve building and shipping an admission webhook on behalf of a helm-operator author, which seems like it could be pretty difficult to orchestrate.
I think adopting the release is reasonable if opted-in. This seems to be outside of the scope of this PR, created an issue for it: https://github.com/operator-framework/helm-operator-plugins/issues/144
@joelanford and @SimonBaeumer to pair on this
@SimonBaeumer: PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Closing as stale. Please re-open if necessary.