Graceful Shutdown Spike
Currently, when a Node is ordered to shut down, it simply kills its own process. No file handles or TCP/database/WebSockets connections are closed, nothing is gently deallocated, nothing is cleaned up in the database, and no Gossip is sent.
Node shutdown should be much more graceful than this. Pending operations should be completed, handles and connections should be closed, neighbors should be notified, and so on.
Create a set of cards detailing all the various operations that need to be performed in the process of a Graceful Shutdown. Note that MASQ-Project/Node#404 and MASQ-Project/Node#407 already exist.
Also to be included is MASQ-Project/Node#410
One of the things to be considered is the logger: what if the Node wants to shut down just as the logger is rotating the logfiles in a background thread? The logger library provides hooks for this, but the hooks will need to be used.
Things to consider:
- Logger: If the logger is rotating files at shutdown, finish the rotation before shutting down
- MASQ-Project/Node#404 UIs: When the Node shuts down, all UIs connected to the Node or the Daemon should be informed
- MASQ-Project/Node#407 VACUUM: Maybe do database compaction on Graceful Shutdown. Maybe not, though; perhaps it ought to be on startup. At any rate, it should be at a time when database traffic has been cut off or has not yet started.
- MASQ-Project/Node#410 UiShutdownResponse: Kind of subsumed in MASQ-Project/Node#404; make sure UIs know when shutdown is complete. This might be best handled by the Daemon. When the UiGateway kills the Node, the Daemon and all the Node-connected UIs should sense it, and the UIs should back off to the Daemon.
Graceful Shutdown on panic?
- PRO:
- There might be (might be) less chance of a panic destroying a database.
- CON:
- It might get pretty complex to gracefully shut down a sick Node, especially if you pop another panic in the process, without just as much chance of destroying the database as with a raw panic. We won't get it for free.
- Second thought: actually, it might not be any different than standard Graceful Shutdown with one recalcitrant Actor. We might get it almost for free.
- Preliminary conclusion: Put it on a spike card and let it be prioritized. Maybe it'll languish.
Paying the Bills: Paying the outstanding bills before a shutdown is important, because small debts become more damaging as they age, and once we're stopped we're going to be allowing them to age indefinitely without payment.
Therefore, once it has been decided that we're shutting down, we should refuse further requests from the browser, signal our neighbors that we will not be routing any more data (possibly by closing TCP connections, but see below), and complete any business that is currently in progress, including any blockchain scans; then we should conduct at least one final Payables scan and pay the outstanding bills before actually shutting down.
However, there is a complicating factor: if we're using clandestine blockchain operations, we'll need to use the MASQ Network, and run up some new debts, in order to conduct that final Payables scan. In order to handle this, we might need some special-case code where we look at the route over which we intend to do the Payables scan and include an extra bonus in the Payables list for each of the Nodes in that route.
Shutdown Order: If we wait to shut down until all business is complete and the Actors are all idle, the order in which we shut down the Actors may become less important. However, if it does end up being important, here are some ways we've considered to get it done.
Graceful Shutdown schemes:
-
Each Actor knows only which other Actors depend on it
- The UiGateway broadcasts "I'm shutting down" WebSockets messages to all its clients and shuts down the WebsocketSupervisor so that no more UI messages will bother it
- UiGateway broadcasts "Everybody shut down!" message and starts a clock
- Each Actor performs whatever pre-shutdown operations are appropriate to it
- Each Actor waits to receive an "Okay, I'm shut down" message from every Actor that depends on it
- The Actor performs its own shutdown operations
- The Actor broadcasts an "Okay, I'm shut down" message to every other Actor that's still alive, including the UiGateway
- The Actor actually stops itself
- The UiGateway waits until it receives an "Okay, I'm shut down" message from every other Actor, or until the clock expires
- If the clock expires before all the okay-I'm-shut-down messages arrive, the UiGateway forcibly terminates all the Actors that are left (or, in a somewhat less brutal version, broadcasts an "Everybody shut down, I really mean it!" to every Actor that's left and starts another clock; when Actors get that message, they stop waiting for dependents and shut down immediately; only if the second message doesn't shut everybody down before the clock expires do the forcible shutdowns come out)
- UiGateway terminates Automap, killing the housekeeping thread and optimally deleting its port mapping
- UiGateway handles logger shutdown, being careful to allow any in-progress file rollover to complete
- UiGateway handles database shutdown (VACUUM?)
- UiGateway handles any other relevant shutdowns (list?)
- UiGateway terminates the Node process
- Benefits:
- The UiGateway doesn't have to know anything about Actor shutdown order; it just has to be able to broadcast a shutdown order to all Actors.
- Potential problems:
- Actors have to know which other Actors they need to transmit messages to; they shouldn't have to know anything about the Actors from which they receive messages. However, this scheme would require that.
- There are a few quasi-circular dependencies among the Actors. For example, when the Accountant finds out it's going to be shutting down, it will probably need to tell the BlockchainBridge to start a Payables scan. But when the Payables scan finishes, the BlockchainBridge will need the Accountant to make the proper modifications to the database. So the Accountant depends on the BlockchainBridge to shut down, and the BlockchainBridge depends on the Accountant to shut down. The dependency isn't actually circular, because it doesn't repeat. A possible solution might be to have the Accountant remember that it had been ordered to stop once already, then start the Payables scan. When the Payables scan finished, the BlockchainBridge could send the results message to the Accountant immediately followed by another "Everybody shut down!" message that looks to the Accountant as though it came from the UiGateway. The Accountant can then notice that this is its second shutdown order and really shut down after sending an "Okay, I'm shut down" message to the UiGateway.
-
Each Actor is assigned to a shutdown level; all Actors on one level must shut down before any Actors in the next level
- Benefits:
- The UiGateway will have more sense of and control over the shutdown process. If one Actor goes crazy and refuses to shut down and has to be killed, the Actors it depends on can still be shut down gracefully, rather than hanging waiting for an "Okay, I'm shut down" message from a crazy dependent.
- Potential problems:
- The UiGateway has to know things about the Actors (their shutdown sequence) that it shouldn't need to know.
- There are a few quasi-circular dependencies among the Actors. For example, when the Accountant finds out it's going to be shutting down, it will probably need to tell the BlockchainBridge to start a Payables scan. But when the Payables scan finishes, the BlockchainBridge will need the Accountant to make the proper modifications to the database. So the Accountant depends on the BlockchainBridge to shut down, and the BlockchainBridge depends on the Accountant to shut down. The dependency isn't actually circular, because it doesn't repeat; but figuring out which levels to put the two actors on is still unclear. A scheme where the Accountant is listed on two levels, with the BlockchainBridge on a level between them, and where the Accountant lies to the UiGateway about shutting down the first time, could solve this but would be hacky.
- Benefits:
-
UiGateway knows the precise sequential order in which the Actors must shut down. This is a special case of the shutdown-level solution, where each Actor has its own private shutdown level.
Actors (each one might get at least one card to implement Graceful Shutdown):
-
Accountant
- Shutdown command from UiGateway
- if (one or more scans in progress) || (no okay-I'm-done message from ProxyServer yet) || (no okay-I'm-done message from ProxyClient yet) * Schedule an identical (or the same) shutdown command for a second (or so) in the future * return
- Start a new Payables scan
- Okay-I'm-done message from BlockchainBridge
- Send okay-I'm-done message to Neighborhood
- Send okay-I'm-done message to UiGateway
- Stop
-
BlockchainBridge
- Shutdown command from UiGateway
- Set a flag: next Payables scan will be the final one
- Payables-scan command from Accountant
- Perform a normal scan
- if flag is set 1. Send okay-I'm-done message to Accountant 2. Send okay-I'm-done message to ProxyServer 3. Send okay-I'm-done message to UiGateway 4. Stop
-
Dispatcher
- Shutdown command from UiGateway
- ignore
- Okay-I'm-done message from ProxyServer
- Send okay-I'm-done message to UiGateway
- Stop
-
Hopper
- Shutdown command from UiGateway
- ignore
- Okay-I'm-done message from Neighborhood && ProxyClient && ProxyServer
- Send okay-I'm-done message to Accountant
- Send okay-I'm-done message to UiGateway
- Stop
-
Neighborhood
- Shutdown command from UiGateway
- Change to Consume-Only mode and Gossip the change
- Update past neighbors in PersistentConfiguration
- Send okay-I'm-done message to Hopper
- Okay-I'm-done messages from Accountant
- Send okay-I'm-done message to UiGateway
- Stop
-
Configurator
- Shutdown command from UiGateway
- Send okay-I'm-done message to UiGateway
- Stop
-
ProxyClient
- Shutdown command from UiGateway
- Send last_data CORES packages to all StreamKeys in .stream_contexts
- Send okay-I'm-done message to Accountant
- Send okay-I'm-done message to Hopper
- Shut down all TCP streams in the StreamHandlerPool
- Send okay-I'm-done message to UiGateway
- Stop
-
ProxyServer
- Shutdown command from UiGateway
- ignore
- Okay-I'm-done message from BlockchainBridge (done with final Payables scan)
- Send last_data CORES packages to all StreamKeys in .stream_key_routes
- Send okay-I'm-done message to StreamHandlerPool
- Send okay-I'm-done message to Accountant
- Send okay-I'm-done message to Dispatcher
- Send okay-I'm-done message to Hopper
- Send okay-I'm-done message to UiGateway
- Stop
-
StreamHandlerPool (ProxyServer)
- Shutdown command from UiGateway
- ignore
- Okay-I'm-done message from ProxyServer
- Shut down all TCP streams
- Send okay-I'm-done message to UiGateway
- Stop
-
UiGateway
- Special handling; see above.
Note There's a commonality above that goes like this:
- Receive shutdown command from UiGateway
- Do something Actor-specific; that is, the kind of thing that could be specified in a closure or virtual method
- Wait until okay-I'm-done messages have been received from an Actor-specific list of Actors
- Do something else Actor-specific; that is, the kind of thing that could be specified in a closure or virtual method
- Send an okay-I'm-done message to the UiGateway
- Stop If this functionality could be built into an object, most of the Actors could have a properly-parameterized instance of that object, and it could handle most of their Graceful-Shutdown responsibilities.
A : B, C means "A can't shut down until after B and C have shut down"
BlockchainBridge :
Configurator :
ProxyClient :
------
ProxyServer : BlockchainBridge
------
StreamHandlerPool : ProxyServer
Dispatcher : ProxyServer
Accountant1 : ProxyServer, ProxyClient
Accountant2 : BlockchainBridge
------
Neighborhood : Accountant
------
Hopper : Neighborhood, ProxyClient, ProxyServer
In order to complete what was designed for the financials statistics, we can now do what we couldn’t until the Graceful Shutdown emerges.
We aren’t capable of retaining long term statistics at the moment. Each Node’s termination causes a loss of the totals we’ve kept.
Of course, the idea is to create rows in the database that would serve to conservation of the totals so that they will be enabled for follow-ups when the Node gets up and wild again. It’s never been hard technically to do that, we’ve been to similar situations many times, but the final database write needs to come in the right time, just when the Node is doing its last cleanup.
(There is a question if we want to perform saves of the cached totals rather periodically than just on a single event. It could keep the statistics safer from data losses occurred on panics)
It is Accountant who needs to all this. It’s already doing the caching but it is always forgotten and begun anew with another session of that Node.