[Epic] Graceful restarts of Timely (i.e., `computed`) clusters
We may spin up Timely processes to replaced crashed ones, while processes from a previous incarnation are still alive. In this case, the new ones will soon crash also, and we will back off in the controller. We consider that to be acceptable for now. However, there may be some rare edge cases where we get into a permanently wedged state or crash loop.
Currently, I believe that those edge cases are very unlikely to happen (e.g., only if there is a second unrelated crash during the initial bringup). However, we should still fix them.
Goals
Acceptance Criteria
After a timely cluster crashes, assuming any computations that caused it to OOM or otherwise crash have been dropped, it will come back up in a reasonable amount of time with 100% probability.
Tasks
- Design a solution, and demonstrate its correctness with a detailed argument (not just trying a series of things to see if they work)
- Implement the solution
Testing
Bugs
Related
- #12946 is superficially connected in that (1) a lot of people confuse the two issues, and (2) they manifest in similar ways, and have therefore been reported as the same issue in the past.
Questions
However, there may be some rare edge cases where we get into a permanently wedged state or crash loop.
Currently, I believe that those edge cases are very unlikely to happen (e.g., only if there is a second unrelated crash during the initial bringup). However, we should still fix them.
What are the edge cases? I know you said this is different from #12946, but I'm not clear on what those differences are!
This problem still exists. It's been preempted by another task for the moment.
I'm still unclear on what the edge cases are here that are different from #12946!
We aren't clear either, that's the problem. Someone needs to spend a few hours (or longer if necessary) thinking hard about this and either write a formal justification for the claim that they always converge to a good state (given enough time between failures), or identify under what circumstances it actually doesn't.
Example of the general flavor of issue: https://github.com/MaterializeInc/materialize/issues/12800
IIRC there are also possible states where we hang indefinitely rather than crash looping. I'll describe one here if I can remember or find it.
I'd personally be inclined to optimistically close this and #12800. We haven't seen any failures of this sort since we killed linkerd, right? I suspect that we can rely on TCP connections failing much faster than Kubernetes's maximum backoff, and that should make this all work out.
Somewhat relatedly, I put up https://github.com/MaterializeInc/materialize/pull/14162 to remove the linkerd workaround from the timely networking establishment.
I don't think this can be closed yet. It is conceptually distinct from issues of the form "TCP connections can randomly go down". Even if we consider such issues to be out-of-scope and less likely due to having removed linkerd, I believe the class of issues I described in #12800 can happen due to issues that still are part of our threat model, and have nothing to do with random TCP flakiness, like a pod going down because the node it was on failed.
It's not urgent because it's rare enough that SREs can manually kill and reboot clusters if they're hanging or crash looping. But it is still an issue, unless I made a mistake in my concrete example of how a problem can arise that I described in #12800, or unless we disagree about which classes of failures are in scope.
It's not urgent because it's rare enough that SREs can manually kill and reboot clusters if they're hanging or crash looping. But it is still an issue, unless I made a mistake in my concrete example of how a problem can arise that I described in https://github.com/MaterializeInc/materialize/issues/12800, or unless we disagree about which classes of failures are in scope.
We now have a demonstration that this is not rare! See #14301.
This seems like it could/should be fixable with a Uuid (or maybe SeqNo) from the controller as part of replica initialization. Each instance should only form a mesh with other instances with the same identifier, rather than bond to prior generation instances. I think this was maybe less clear before replica rehydration was "drop replica; add replica" but it seems like it should be more straightforward now.
If this is wrong, or misses some important part of this issue, we should write down what I've missed so that I don't think this will just solve it each time. :D
This seems like it could/should be fixable with a
Uuid(or maybeSeqNo) from the controller as part of replica initialization. Each instance should only form a mesh with other instances with the same identifier, rather than bond to prior generation instances. I think this was maybe less clear before replica rehydration was "drop replica; add replica" but it seems like it should be more straightforward now.If this is wrong, or misses some important part of this issue, we should write down what I've missed so that I don't think this will just solve it each time. :D
No, I think that works! I think you need worker processes to crash if they realize when making or receiving a connection that the other end knows about a newer epic. But otherwise it should work.
Initially the processes have no state, and thus it does not make sense to distinguish between various incarnations of them. For example, if process 1 is connected to process 2 and the latter goes down, there is no good reason for process 1 to conclude "my whole mesh is hosed, I need to panic!" It can just drop its connection to (the old incarnation of) process 2, wait for a new one to come up, and then connect to that one.
So I think the fundamental fix needed here is to not wait for all the connections to come up before dropping one. We should instead be continually monitoring all connections during startup, and if one goes down, we should drop it and let it be re-initialized (i.e., if it's one of the ones we made, make it again; otherwise, if it's one of the ones we are waiting for from a peer, allow it to re-connect).
All issues that I know about are now fixed. Leaving this open pending QA approval.