Operator should help autorecover in common deploy failure states

Open sync-by-unito[bot] opened this issue 3 years ago • 1 comments

We have noticed several common Vizier failure states from observing our unhealthy user clusters. A lot of these cases may occur throughout the lifetime of the cluster (for example, high load causing evictions), and leave Vizier in an unhealthy state. At the current moment, the only remediation for these clusters is to manually clobber and redeploy. Since we have the operator running in the cluster, it should be able to detect the situation and redeploy the necessary dependencies when needed.

Tasks:

[ ] Operator should detect when NATS is in a bad state

[ ] Operator should restart NATS when it is in a bad state

[ ] Operator should detect when etcd is in a bad state

[ ] Operator should restart etcd when it is in a bad state

[ ] Operator should detect when PV fails to mount and fallback to etcd

┆Issue is synchronized with this Jira Story by Unito

Jul 12 '22 05:07 sync-by-unito[bot]

Can't edit the description, so writing the progress in this comment

[x] Operator should detect when NATS is in a bad state
[x] Operator should restart NATS when it is in a bad state
[x] Operator should detect when etcd is in a bad state
[x] Operator should restart etcd when it is in a bad state
[x] Operator should detect when PV fails to mount and fallback to etcd

Aug 18 '22 20:08 philkuz