Rerun commands after failure

Open g-gaston opened this issue 3 years ago • 0 comments

What would you like to be added

The ability to rerun CLI commands after an error.

Why is this needed

Most of the errors returned by the CLI cluster commands are either transient or fixable with manual intervention. However, the CLI commands are not idempotent. This makes impossible to rerun commands after they have already failed, even if the root cause of the issues has already been solved.

This makes the experience very disruptive, specially for the upgrade cluster command, which could leave the cluster in an irrecoverable state without the necessary manual cleanup. This "cleanup" is not documented and it heavily dependents on the internal CLI's implementation. For certain scenarios, it is not even possible, requiring users to destroy and recreate their clusters from scratch.

Solution

The ideal solution for this problem is to move most of the CLI logic to a kubernetes controller and run it in the bootstrap (or management) cluster. The reconciliation loop is the perfect pattern for this scenario. The current ongoing work to offer full cluster lifecycle APIs will push us forward into this direction. However, it will require a significant amount of time to be completed and it will not even be enough (it's not scoped to include running it from a bootstrap cluster).

For a CLI centric implementation, making the commands idempotent will be the ideal solution. This presents multiple challenges, like the dependency on external binaries operations that might not be idempotent or the difficulty of moving objects back to a cluster with an unreachable control plane. Unfortunately, this solution will probably take even more time than the previous one.

A third option, less robust but significantly simpler, would be to implement a "checkpoint" capability. The CLI will keep a registry of all completed steps for a workflow and the necessary data to restore its state. When a command is rerun after an error, the CLI will skip the completed steps, restoring the state of the program. This requires for each command step to be idempotent withing itself: if they fail mid execution, we need the to be able to rerun them from the beginning. If one the steps breaks this rule, it will need to be slit into two or more steps.

Proposal

I propose implementing the third option: using a checkpoint. The checkpoint data would be stored as a file in disk and the idempotent steps would be the Tasks in the command workflows.

A POC proving this idea can be found here. For a codified example of the proposed user flow, refer to this E2E test.

This an example of the checkpoint file:

completedTasks:
  bootstrap-cluster-init:
    ExistingManagement: false
    KubeconfigFile: m-docker/generated/m-docker.kind.kubeconfig
    Name: m-docker
  capi-management-move-to-bootstrap: null
  ensure-etcd-capi-components-exist: null
  install-capi: null
  pause-controllers-reconcile: null
  setup-and-validate: null
  update-secrets: null
  upgrade-core-components:
    components:
    - name: EKS-D
      newVersion: kubernetes-1-21-eks-12
      oldVersion: kubernetes-1-20-eks-14
  upgrade-needed:
    Needed: true

The completed tasks are stored as an object in a yaml file (a map[string]interface{} in go code) with the task names being the keys and the state data being the values.

Implementation notes

This issue is only intended to describe the problem and propose an idea on how to solve it. A proper design doc should be presented to the team before implementing it (probably not more than a page). If a better idea or a variation of the one proposed here is found during this design phase, please go ahead with it.

Take into account that the POC's only purpose was to prove the validity of this idea and not to serve as an implementation reference. We should find the best way to implement this idea into code and the POC is far from that. DO NOT use the code or the checkpoint file's format in the POC as a design reference.

Tasks

[ ] #2702
[x] #2703
[ ] Add checkpoint feature to create command
[ ] E2E tests for create with checkpoint
[ ] Update docs

May 17 '22 11:05 g-gaston