Add scenarios for executing rollbacks.
Describe the feature
What problem are you trying to solve? Previously, Flagger would trigger a rollback only when the analysis failed a certain number of times. But in actual use, this type of rollback trigger scenario does not cover enough situations. I think a rollback should be performed when the expected state do not match the actual state. During analysis, reaching a certain number of failures is one scenario where the expected state does not match the actual state. After passing analysis, if the actual cluster status does not match expectations for a period of time or after a few checks, a rollback should also be triggered. Currently Flagger is lacking in rollback triggers to check the process from passing analysis to official cluster release.
Proposed solution
What do you want to happen? Add any considered drawbacks. Add rollback checks during the process from passing analysis to completing deployment in the cluster, by comparing the cluster status and expected status to determine whether to trigger a rollback.
Any alternatives you've considered?
Is there another way to solve this problem that isn't as good a solution? I do not have an alternative solution now. Maybe a better solution can be found through community discussion.
by comparing the cluster status and expected status
could you provide an example of what you mean by this?
For example, if a deployment/podinfo that changed the image version passed the canary analysis, then according to the normal release process, Flagger would assign podinfo-canary to podinfo-primary, and gradually/fully migrate the traffic to podinfo-primary to complete the release. Then in the cluster, the existence of the specified number of podinfo-primary replicas that have upgraded to the target image version, is the expected cluster state that we want.
But if due to cluster resource issues or other reasons, the upgrade of the image version for podinfo-primary fails or the replicas are unable to reach our expected number, it will cause Flagger to remain stuck in the stage of waiting for podinfo-primary to become ready. This represents the actual situation in the cluster. #1591
The solution I came up with is to create a resource object to store the expected state of the deployment/podinfo. Then use this object to create podinfo-canary. If it passes analysis, Flagger needs to compare if the podinfo-primary in the cluster fetched via client.Get() is fully consistent with the resource object. If there are multiple inconsistencies, it means the rollout failed and a rollback should be performed.
Here is a preview image that contains a fix for this bug: ghcr.io/fluxcd/flagger:rc-bb949c08. We would appreciate it if you could try it and confirm whether the fix works. Thanks! :)
After my verification, it meets my usage scenario. Thank you for your hard work. @aryan9600