Add state and description codes: Part 2
This PR adds the handling of the state and description codes by the Middleware and Supervisor. This is based on the groundwork done in #300.
Requirements:
- The BitBoxBase should provide a global state code (IDLE, WORKING, WARNING, ERROR) to the user interfaces (BitBoxApp and HSM status LED)
- The BitBoxBase should provide a more detailed description code for each subsystem (Base, Middleware, Bitcoin Core, c-lightning, ...), if something happens that the user should know.
- In a first iteration only the description code with the highest priority should be displayed by the App and the HSM. The Supervisor handles the prioritization
- Both codes should be stored in Redis.
- The Supervisor should be the only one writing the codes to Redis when a subsystem state changes
- The Middleware will provide the codes to the HSM and the App
Definition


(taken from this private Google Doc)
Approach
The description codes are implemented using a Redis Sorted set. A Redis Sorted set has members in the form of [score] [value]. The set is sorted by the [score]. In the implementation the [score] is the priority of a subsystem state and the [value] is the code of a subsystem state. A subsystem state is active if it's a member of the sorted set. The state is inactive when it's not in the sorted set. The highest priority is at the top of the sorted set, the lowest priority at the bottom of the sorted set. The Middleware queries the highest priority subsystem state and relays it to the HSM and App.
Special cases:
- The description code with a code of 0 and a priority of 0 is always active. If it's the only member in the set, then there is no subsystem state that requires the users attention.
- If Redis is not reachable the Middleware sends description code 4 to the HSM and the App. This is meant as a fallback, but should never appear as an actual member in Redis.
Implementation
The Supervisor activates and deactivates description codes / subsystem states as soon as he learns that the state changes (currently by watching logs and Prometheus values). When a new state is set the Supervisor queries the highest priority description code (top element from the sorted set) and sets the state code accordingly. The Supervisor then sends notification via IPC to the Middleware that the subsystem state might have changed. The Middleware lets the HSM know about the new state.
For the subsystem states DOWNLOADING_UPDATE, UPDATE_FAILED, REBOOT and SHUTDOWN the Middleware logs distinct log-tags that let the Supervisors log-watcher know
that the subsystem state has changed. The INITIAL_BLOCK_SYNC subsystem state is set based on the Prometheus value bitcoin_ibd which is watched by the Prometheus-watcher of the Supervisor.
TODOs out of scope for this PR
- add a watcher and trigger for NO_NETWORK_CONNECTION to the Supervisor (@Stadicus started working on this in #301)
- add a watcher and trigger for OUT_OF_DISK_SPACE to the Supervisor (can probably be done via the existing fields in Prometheus, see https://github.com/shiftdevices/bitbox-base-internal/issues/230#issuecomment-531768649)
- add the state code and the description code to the GetBaseInfoResponse so they can be displayed in the App as well
- notify the App that the state/description codes might have changed
Added some needed logging to the Supervisor triggers. Reviewing should be easier now.
After updating a running BitBoxBase, the Supervisor logs the following over and over:
Dec 07 15:08:39 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:08:39 Setting bitcoin_ibd since no prior state exists.
Dec 07 15:08:39 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:08:39 setBBBConfigValue: command '/usr/local/sbin/bbb-config.sh set bitcoin_ibd false' executed.
Dec 07 15:08:39 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:08:39 Deactivating the sub-system state "INITIAL_BLOCK_SYNC"
Dec 07 15:08:39 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:08:39 Recovered: could not trigger prometheusBitcoindIBD: Handling trigger "prometheusBitcoindIBD": Initial set. Could not deactivate subsystem state "INITIAL_BLOCK_SYNC": could not get the top element from "base:descriptionCode": expected exactly one element, but got 0
Both the Grafana dashboard and the Bitcoin scraper show bitcoin_ibd as 0
When bitcoind enters Initial Block Download mode, the state change seems to work fine:
Dec 07 15:13:09 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:13:09 Setting bitcoin_ibd since no prior state exists.
Dec 07 15:13:16 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:13:16 setBBBConfigValue: command '/usr/local/sbin/bbb-config.sh set bitcoin_ibd true' executed.
Dec 07 15:13:16 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:13:16 Activating the sub-system state "INITIAL_BLOCK_SYNC"
Dec 07 15:13:16 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:13:16 Notifying the Middleware that a sub-system state changed
Switching back to IBD == false, the error logging continues, with a new entry every 10 seconds:
Dec 07 15:18:09 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:18:09 IBD finished: unsetting bitcoind_idb.
Dec 07 15:18:24 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:18:24 setBBBConfigValue: command '/usr/local/sbin/bbb-config.sh set bitcoin_ibd false' executed.
Dec 07 15:18:24 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:18:24 Deactivating the sub-system state "INITIAL_BLOCK_SYNC"
Dec 07 15:18:24 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:18:24 Recovered: could not trigger prometheusBitcoindIBD: Handling trigger "prometheusBitcoindIBD": Could not deactivate subsystem state "INITIAL_BLOCK_SYNC": could not get the top element from "base:descriptionCode": expected exactly one element, but got 0
Dec 07 15:18:24 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:18:24 IBD finished: unsetting bitcoind_idb.
Dec 07 15:18:24 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:18:24 setBBBConfigValue: command '/usr/local/sbin/bbb-config.sh set bitcoin_ibd false' executed.
Dec 07 15:18:24 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:18:24 Deactivating the sub-system state "INITIAL_BLOCK_SYNC"
Dec 07 15:18:24 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:18:24 Recovered: could not trigger prometheusBitcoindIBD: Handling trigger "prometheusBitcoindIBD": Could not deactivate subsystem state "INITIAL_BLOCK_SYNC": could not get the top element from "base:descriptionCode": expected exactly one element, but got 0
Dec 07 15:18:34 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:18:34 IBD finished: unsetting bitcoind_idb.
Dec 07 15:18:34 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:18:34 setBBBConfigValue: command '/usr/local/sbin/bbb-config.sh set bitcoin_ibd false' executed.
Dec 07 15:18:34 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:18:34 Deactivating the sub-system state "INITIAL_BLOCK_SYNC"
Dec 07 15:18:34 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:18:34 Recovered: could not trigger prometheusBitcoindIBD: Handling trigger "prometheusBitcoindIBD": Could not deactivate subsystem state "INITIAL_BLOCK_SYNC": could not get the top element from "base:descriptionCode": expected exactly one element, but got 0
The following notes are out of scope for this first iteration, but @0xB10C's thoughts on how to implement would be interesting.
The subsystem states DOWNLOADING_UPDATE and UPDATE_FAILED are coming from the Middleware, but I think Mender should be a separate subsystem. Currently, these state changes are only recognized if triggered from the Middleware.
An update can also be triggered from the command line from a USB drive or the maintenance menu (e.g. for a factory reset). It would be better if the bbb-cmd.sh and the systemd-update-checks.sh would notify the Superivsor that an update has been started, or that it failed, e.g. through the standard systemd logs.
The same goes for REBOOT and SHUTDOWN. When powering down the device with the physical power button, or from the command line, the state should be set correctly, e.g. using a system unit that is executed on shutdown (and the HSM then showing "safe to unplug" or similar).
FYI, when I merge this PR on top of all other outstanding PRs, I get the following error message, so I guess it needs some minor adjustments after rebasing it:
src/middleware.go:271:21: middleware.hsmHeartbeat undefined (type *Middleware has no field or method hsmHeartbeat)
Could not deactivate subsystem state "INITIAL_BLOCK_SYNC": could not get the top element from "base:descriptionCode": expected exactly one element, but got 0
This error usually means that there is no "idle" state in the sorted-set. Did you test on an recent (including the changes from #300) image? Otherwise you'd have to add the the Redis keys manually. See https://github.com/digitalbitbox/bitbox-base/commit/71c173ffd24f2a2b3a86b3b55703ee0d2d2e9b1a.
An update can also be triggered from the command line from a USB drive or the maintenance menu (e.g. for a factory reset). It would be better if the
bbb-cmd.shand thesystemd-update-checks.shwould notify the Superivsor that an update has been started, or that it failed, e.g. through the standard systemd logs.The same goes for
REBOOTandSHUTDOWN. When powering down the device with the physical power button, or from the command line, the state should be set correctly, e.g. using a system unit that is executed on shutdown (and the HSM then showing "safe to unplug" or similar)
Having the Supervisor watch a systemd log with journald doesnt seem to be a good fit for this. Using a named pipe to notify the Supervisor on triggers should work. The ipcnotificaton package in the Middleware could be used for a Supervisor specific pipe. The shell scripts could emit a notification, which would trigger the Supervisor.
Could not deactivate subsystem state "INITIAL_BLOCK_SYNC": could not get the top element from "base:descriptionCode": expected exactly one element, but got 0
This error usually means that there is no "idle" state in the sorted-set. Did you test on an recent (including the changes from #300) image? Otherwise you'd have to add the the Redis keys manually. See 71c173f.
Seems the key base:descriptionCode is not added during the build process, despite being present in the factorysettings.txt.
$ redis-cli ZREVRANGE base:descriptionCode 0 0
(empty list or set)
After adding the key/value manually with ZADD base:descriptionCode 0 0, this error is no longer logged. I will take a look at the build process to figure this out.