Don't instantly fail if redeploying a Deployment past PDS

Open timothysmith0609 opened this issue 6 years ago • 1 comments

Proposal:

Currently, if someone attempts to redeploy a deployment in the state progressDeadlineExceeded, we instantly fail the deploy. In general, this leads users to think their deploy is fundamentally broken and will not succeed. For the most part, this error is usually indicative of slow cluster operations, and deploys will (eventually) be successful. I propose it would be a better user experience if we handled this case a little more gracefully.

Alternative 1

If the apply of a deployment results in a change to its spec, the status field is cleared and the progressing status is set to true. On redeploy, if we are able to see that apply does nothing and that progressing == false, we know we are in a situation where we are trying to rewatch an in-flight deployment. Coding this has several issues:

There is no 'lastUpdated' field for the most recent change -> we don't know if we're seeing a new deploy.
It's difficult to initially observe the Deployment resource to know if it is already in progressDeadlineExceeded, and also difficult to pass that information nicely to the resource object being watched.

One approach could be to look at deploy_started_at and see if it's been at least progressDeadlineSeconds. If so, we know we're in the case where we're watching a redeploy.

Alternative 2

Instead of failing the deploy when progressDeadlineExceeded, we could issue a warning, but continue to watch. If we exceed the global timeout, we can then raise a fatal deployment error as normal.

Nov 11 '19 22:11 timothysmith0609

To clarify, when you say fail, do you really mean "time out"? We should be timing out deploys based on PDS, not failing them.

On the one hand, I think that it makes sense for us to give the same result if you instantly mash retry and nothing has changed in the system. On the other hand, I'm sympathetic to the view that timing out instantly is semantically confusing. I'm open to alternative 1, i.e. make deploy_failing_to_progress? also compare deploy_started_at to the PDS. I would not consider alternative 2, for several reasons:

Warnings are not ever seen when krane is invoke in a higher-level CD system (as is typical).
PDS is configured by the developer and is the canonical way for Deployment owners to signal to k8s systems (including krane) that some irregularity is causing convergence to fail.
The global timeout exists to protect CD systems from running krane forever, not as a developer feedback mechanism. It may be set very high in some cases, or even left unset.

A related thought is that having a separate watch command would help this situation. I dunno how we'd hook it up in Shipit, but re-watch is really what the user is trying to do in this case, not re-deploy. cc @RyanBrushett because you were prototyping this.

Nov 27 '19 18:11 KnVerey