materialize [Epic] Graceful restarts of Timely (i.e., `computed`) clusters

We may spin up Timely processes to replaced crashed ones, while processes from a previous incarnation are still alive. In this case, the new ones will soon crash also, and we will back off in the controller. We consider that to be acceptable for now. However, there may be some rare edge cases where we get into a permanently wedged state or crash loop.

Currently, I believe that those edge cases are very unlikely to happen (e.g., only if there is a second unrelated crash during the initial bringup). However, we should still fix them.

Goals

Acceptance Criteria

After a timely cluster crashes, assuming any computations that caused it to OOM or otherwise crash have been dropped, it will come back up in a reasonable amount of time with 100% probability.

Tasks

Design a solution, and demonstrate its correctness with a detailed argument (not just trying a series of things to see if they work)
Implement the solution

Testing

Bugs

Questions

Jun 16 '22 15:06 umanwizard

However, there may be some rare edge cases where we get into a permanently wedged state or crash loop.

Currently, I believe that those edge cases are very unlikely to happen (e.g., only if there is a second unrelated crash during the initial bringup). However, we should still fix them.

What are the edge cases? I know you said this is different from #12946, but I'm not clear on what those differences are!

Jul 03 '22 20:07 benesch

This problem still exists. It's been preempted by another task for the moment.

Jul 20 '22 11:07 antiguru

I'm still unclear on what the edge cases are here that are different from #12946!

Jul 21 '22 04:07 benesch

We aren't clear either, that's the problem. Someone needs to spend a few hours (or longer if necessary) thinking hard about this and either write a formal justification for the claim that they always converge to a good state (given enough time between failures), or identify under what circumstances it actually doesn't.

Example of the general flavor of issue: https://github.com/MaterializeInc/materialize/issues/12800

Jul 28 '22 14:07 umanwizard

IIRC there are also possible states where we hang indefinitely rather than crash looping. I'll describe one here if I can remember or find it.

Jul 28 '22 14:07 umanwizard

I'd personally be inclined to optimistically close this and #12800. We haven't seen any failures of this sort since we killed linkerd, right? I suspect that we can rely on TCP connections failing much faster than Kubernetes's maximum backoff, and that should make this all work out.

Somewhat relatedly, I put up https://github.com/MaterializeInc/materialize/pull/14162 to remove the linkerd workaround from the timely networking establishment.

Aug 10 '22 07:08 benesch

I don't think this can be closed yet. It is conceptually distinct from issues of the form "TCP connections can randomly go down". Even if we consider such issues to be out-of-scope and less likely due to having removed linkerd, I believe the class of issues I described in #12800 can happen due to issues that still are part of our threat model, and have nothing to do with random TCP flakiness, like a pod going down because the node it was on failed.

It's not urgent because it's rare enough that SREs can manually kill and reboot clusters if they're hanging or crash looping. But it is still an issue, unless I made a mistake in my concrete example of how a problem can arise that I described in #12800, or unless we disagree about which classes of failures are in scope.

Aug 10 '22 14:08 umanwizard

It's not urgent because it's rare enough that SREs can manually kill and reboot clusters if they're hanging or crash looping. But it is still an issue, unless I made a mistake in my concrete example of how a problem can arise that I described in https://github.com/MaterializeInc/materialize/issues/12800, or unless we disagree about which classes of failures are in scope.

We now have a demonstration that this is not rare! See #14301.

Aug 19 '22 18:08 benesch

This seems like it could/should be fixable with a Uuid (or maybe SeqNo) from the controller as part of replica initialization. Each instance should only form a mesh with other instances with the same identifier, rather than bond to prior generation instances. I think this was maybe less clear before replica rehydration was "drop replica; add replica" but it seems like it should be more straightforward now.

If this is wrong, or misses some important part of this issue, we should write down what I've missed so that I don't think this will just solve it each time. :D

Aug 19 '22 20:08 frankmcsherry

This seems like it could/should be fixable with a Uuid (or maybe SeqNo) from the controller as part of replica initialization. Each instance should only form a mesh with other instances with the same identifier, rather than bond to prior generation instances. I think this was maybe less clear before replica rehydration was "drop replica; add replica" but it seems like it should be more straightforward now.

If this is wrong, or misses some important part of this issue, we should write down what I've missed so that I don't think this will just solve it each time. :D

No, I think that works! I think you need worker processes to crash if they realize when making or receiving a connection that the other end knows about a newer epic. But otherwise it should work.

Aug 23 '22 04:08 benesch

Initially the processes have no state, and thus it does not make sense to distinguish between various incarnations of them. For example, if process 1 is connected to process 2 and the latter goes down, there is no good reason for process 1 to conclude "my whole mesh is hosed, I need to panic!" It can just drop its connection to (the old incarnation of) process 2, wait for a new one to come up, and then connect to that one.

So I think the fundamental fix needed here is to not wait for all the connections to come up before dropping one. We should instead be continually monitoring all connections during startup, and if one goes down, we should drop it and let it be re-initialized (i.e., if it's one of the ones we made, make it again; otherwise, if it's one of the ones we are waiting for from a peer, allow it to re-connect).

Aug 29 '22 22:08 umanwizard

All issues that I know about are now fixed. Leaving this open pending QA approval.

Aug 31 '22 19:08 umanwizard

[Epic] Graceful restarts of Timely (i.e., `computed`) clusters

Goals

Acceptance Criteria

Tasks

Testing

Bugs

Related

Questions