operator-controller [epic] Add retry limits and a new reason to the `Progressing` status condition that is present when retry limit is reached.

As part of the v1 API stabilization effort, a few of us (@joelanford @grokspawn @LalatenduMohanty and I) identified that it would be a better user experience if the Progressing condition when set to True was considered a "happy" state where generally everything is okay and we are actively attempting to progress towards a future state OR are at the desired state and ready to progress towards any future desired states. We felt that Retrying is currently a mix of a "sad" and "happy" reason but is generally more reflective of "we encountered a hiccup in progressing" as opposed to requiring user intervention.

In an effort to make the Retrying reason be considered "happier", it was concluded that it would be best to add some form of a retry limit that when reached would result in the Progressing condition being set to False with a reason along the lines of RetryLimitExceeded to signal to users that the ClusterExtension/ClusterCatalog is no longer in a "happy" state.

Nov 08 '24 21:11 everettraven

Looking at Progressing condition and assuming a user perspective, I think I'd be expecting it to be True when there is reconciliation going on and we're trying to transition to a new desired state, or False otherwise. I am not sure if True should signify anything above that simple state.

Similarly, setting a retry limit might prove tricky, because the issue preventing the extension to transition to a new state might be removed at an arbitrary time and I am not sure we'd be able to clearly differentiate between an error that's not recoverable and error that's recoverable. This seems to fall into domain knowledge that only the specific controller we're trying to transition might actually know.

This may be naive, but what if we introduced a new ~ProgressingError condition?

In case of an upgrade error, our status might look like:

  - lastTransitionTime: <time>
    message: <detailed error message>
    observedGeneration: 2
    reason: Upgrade // or Rollback etc. - action which is failing
    status: "True"
    type: ProgressingError
  - lastTransitionTime: <time>
    message: "Upgrading x to version x.x.x"
    observedGeneration: 2
    reason: Retrying
    status: "True"
    type: Progressing

With this I think we would be able:

clearly communicate state in a predictable way
have a straightforward means of detecting that something is wrong, which is the main issue here

WDYT?

@everettraven @joelanford @grokspawn @LalatenduMohanty

Mar 06 '25 12:03 azych

Looking at Progressing condition and assuming a user perspective, I think I'd be expecting it to be True when there is reconciliation going on and we're trying to transition to a new desired state, or False otherwise. I am not sure if True should signify anything above that simple state.

Depends on the user :). The Deployment API in Kubernetes sets Progressing to True even when the rollout is completed: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#complete-deployment

The idea of following this pattern was so that tooling like kubectl wait --for=Condition=Status could be used to check if a ClusterExtension or ClusterCatalog is in a good or bad state. If you overload Progressing=False as both a happy and a sad state it makes it harder to use something like kubectl wait ... for an evaluation because you then have to look at the reason. So Progressing=True as a catch all for the happy state and Progressing=False as a catch all for the sad state enables user workflows that use a quick glance of Condition+Status to know if something is generally OK or needs more attention.

Similarly, setting a retry limit might prove tricky, because the issue preventing the extension to transition to a new state might be removed at an arbitrary time and I am not sure we'd be able to clearly differentiate between an error that's not recoverable and error that's recoverable. This seems to fall into domain knowledge that only the specific controller we're trying to transition might actually know.

I believe there is existing logic within both operator-controller and catalogd related to "terminal" (non-recoverable) and recoverable errors. If a retry limit is too tricky to implement, you could always consider a progression deadline that if you are in a state of Progressing=True, Retrying for a duration that exceeds the deadline you could go to Progressing=False, ProgressionDeadlineExceeded.

This may be naive, but what if we introduced a new ~ProgressingError condition?

Why not just use the existing Progressing condition's message field to show the error?

Mar 06 '25 21:03 everettraven

I am not sure we'd be able to clearly differentiate between an error that's not recoverable and error that's recoverable.

IIRC, if an error is not recoverable, we can immediately set Progressing=False with the error. If the error is recoverable, then we wouldn't set Progressing=False until spec.progressDeadlineSeconds expires.

Here's the deployment.spec.progressDeadlineSeconds API doc:

progressDeadlineSeconds The maximum time in seconds for a deployment to make progress before it is considered to be failed. The deployment controller will continue to process failed deployments and a condition with a ProgressDeadlineExceeded reason will be surfaced in the deployment status. Note that progress will not be estimated during the time a deployment is paused. Defaults to 600s.

Mar 11 '25 15:03 joelanford

Looking at Progressing condition and assuming a user perspective, I think I'd be expecting it to be True when there is reconciliation going on and we're trying to transition to a new desired state, or False otherwise. I am not sure if True should signify anything above that simple state.

Depends on the user :). The Deployment API in Kubernetes sets Progressing to True even when the rollout is completed: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#complete-deployment

That's interesting, I did not know this was how state signaling worked in Deployments. It's a little unintuitive to me (progressing being "true" could mean you are ready to progress rather than actually progressing;)), but I guess once you know that's how it works it makes more sense.

The idea of following this pattern was so that tooling like kubectl wait --for=Condition=Status could be used to check if a ClusterExtension or ClusterCatalog is in a good or bad state. If you overload Progressing=False as both a happy and a sad state it makes it harder to use something like kubectl wait ... for an evaluation because you then have to look at the reason. So Progressing=True as a catch all for the happy state and Progressing=False as a catch all for the sad state enables user workflows that use a quick glance of Condition+Status to know if something is generally OK or needs more attention.

Yes, I wasn't questioning the idea to move away from overloading Progressing, just thinking about an alternative and potentially more intuitive way to approach it.

Similarly, setting a retry limit might prove tricky, because the issue preventing the extension to transition to a new state might be removed at an arbitrary time and I am not sure we'd be able to clearly differentiate between an error that's not recoverable and error that's recoverable. This seems to fall into domain knowledge that only the specific controller we're trying to transition might actually know.

I believe there is existing logic within both operator-controller and catalogd related to "terminal" (non-recoverable) and recoverable errors. If a retry limit is too tricky to implement, you could always consider a progression deadline that if you are in a state of Progressing=True, Retrying for a duration that exceeds the deadline you could go to Progressing=False, ProgressionDeadlineExceeded.

My thinking was that while there might be an error happening for some time, an issue causing that error might be removed externally at any point which could make the reconciliation progress to completion. So would an idea here be that even though we'd set Progressing=False conveying that this is a 'terminal' error, we'd still be trying to retry and reconcile in controller?

This may be naive, but what if we introduced a new ~ProgressingError condition?

Why not just use the existing Progressing condition's message field to show the error?

This is exactly how it works now and it means Progressing is overloaded with good/bad state, so you have to look for that error in message rather than have a clear way of knowing there's an issue. With the new proposed condition you'd have a clear (and IMO intuitive) separation and you're able to convey additional useful information, like what's the action you're progressing based on (upgrade, installation etc.)?

Mar 12 '25 13:03 azych

Issues go stale after 90 days of inactivity. If there is no further activity, the issue will be closed in another 30 days.

Jun 11 '25 01:06 github-actions[bot]