Implement the Strategy Pattern for the SubstrateNodeMonitor
Technical Story
When you have a multiprocessing system you have to watch out how many processes you are going to spawn. There are two reasons why you need to do this:
- If the number of processes becomes a lot bigger than the number of CPUs then the Operating System will spend most of its time context switching between processes which is time consuming
- Processes take a lot of memory, therefore if you spawn a lot of processes you may eventually run out of memory
If we only focus on the Monitors, currently we are creating a manager process for every type of monitorable and a monitor process for every monitorable. For example, suppose that the user added 4 cosmos nodes and 1 DokerHub repo for monitoring. On startup PANIC is going to start a ContractsMonitorsManager, NetworkMonitorsManager, NodeMonitorsManager, SystemMonitorsManager, DockerHubMonitorsManager and a GitHubMonitorsManager all in a separate process. In addition to this, PANIC will start 4 CosmosNodeMonitors and 1 DockerHubMonitor in a separate process. As a result we are creating a lot of processes which will portentially increase as the node operator adds more monitorables. At a larger scale we might end up having a slow system and/or run out of memory.
To solve this it is being proposed that we start reducing the number of processes by using a combination of processes and threads. We can start by first focusing on the Monitors, benchmark the implementation and if there is benefit we would incorporate these changes to other components. The idea is to have a single MonitorsManager which spawns a thread for each monitorable. As per the resources below, threads are more memory efficient and lightweight to handle. When implementing the threaded monitor we have two options:
- To implement a long-lived thread which connects to rabbit once and performs work every 10 seconds.
- To implement a thread whose work is to connect to rabbit, do work, disconnects from rabbit and terminates.
It is suggested that we perform implementation 1 because according to the RabbitMQ docs the rabbit server works better with long-lived connections
For this huge task to be completed we need to tackle the following:
- Implement the Strategy Pattern for every type of monitor to have a code-base of higher quality.
- Implement a single
MonitorsManagerthat is able to receive configurations and use the appropriate strategy to start a monitor in a separate thread based on the routing key - Add the monitorables store functionality that we already have in present managers.
- Add the heartbeats functionality that we already have in present managers.
Therefore to easily handle this large change we will break the task described above into granular tickets.
The aim of this ticket is to develop the Strategy Pattern requirments for the SubstrateNodeMonitor
Resources:
- https://www.indeed.com/career-advice/career-development/multithreading-vs-multiprocessing
- https://timber.io/blog/multiprocessing-vs-multithreading-in-python-what-you-need-to-know/
- https://towardsdatascience.com/multithreading-multiprocessing-python-180d0975ab29
- https://refactoring.guru/design-patterns/strategy/python/example
- https://auth0.com/blog/strategy-design-pattern-in-python/
Requirements
The aim of the Strategy Pattern is to make it easy to encapsulate the tasks needed to be done by the MonitorsManager in order to start the SubstrateNodeMonitor, namely:
- Parsing the
SubtrateNodeConfig - Starting the
SubstrateNodeMonitorusing threads with the appropriate fields.
Some code snippets that might be useful:
class MonitorStarter:
"""
This is the Context class, it defines the interface of interest to clients.
"""
def __init__(self, strategy: MonitorStarterStrategy) -> None:
"""
Usually, the Context accepts a strategy through the constructor, but
also provides a setter to change it at runtime.
"""
self._strategy = strategy
@property
def strategy(self) -> MonitorStarterStrategy:
"""
The Context maintains a reference to one of the Strategy objects. The
Context does not know the concrete class of a strategy. It should work
with all strategies via the Strategy interface.
"""
return self._strategy
@strategy.setter
def strategy(self, strategy: MonitorStarterStrategy) -> None:
"""
Usually, the Context allows replacing a Strategy object at runtime.
"""
self._strategy = strategy
def parse_and_start(self, config: ConfigDict) -> None:
"""
The Context delegates some work to the Strategy object instead of
implementing multiple versions of the algorithm on its own.
"""
parsed_config = self._strategy.parse(config)
self._strategy.start(parsed_config)
class MonitorStarterStrategy(ABC):
"""
The Strategy interface declares operations common to all supported versions
of some algorithm.
The Context uses this interface to call the algorithm defined by Concrete
Strategies.
"""
@abstractmethod
def parse(self, config: ConfigDict) -> Config:
pass
@abstractmethod
def start(self, config: Config) -> None:
pass
class SubstrateNodeMonitorStarterStrategy(Strategy):
def parse(self, config: SubstrateNodeConfigDict) -> SubstrateNodeConfig:
return SubstrateNodeConfig(config['node_id'], config['ws_url'])
def start(self, config: SubstrateNodeConfig) -> None:
redis = Redis('localhost')
monitor = SubstrateNodeMonitor(config, redis)
thread.start(monitor.start())
Some Notes:
- Objects such as data sources may be shared between threads, this means that although the bulk of the
SubstrateNodeMonitorimplementation should not be effected we might require a lock to access shared objects such as the list ofSubstrateNodesto be used as data sources. This needs further investigation, however, if we create new objects from dictionaries for a particular thread we might not need locks because each object would be unique. However, with this approach whenever there is a config update/removal we need to restart a long-lived thread. - Since we are able to use shared memory we can use a single Monitors logger for both the manager and the individual monitors running in separate threads. This will help us to reduce the number of log files that are currently being generated by PANIC. However we must make sure that we are able to do so in a safe manner. According to this thread it seems that the logging module is thread safe https://stackoverflow.com/questions/2973900/is-pythons-logging-module-thread-safe
- Each thread must have its own Rabbit connection as Pika is not thread safe
- Each thread must be set as daemon just in case the parent process needs to exit.
- Each monitor must handle the thread stopping criteria possibly using exit flags or exceptions.
Acceptance criteria
Scenario: The SubstrateNodeMonitor starting procedure should satisfy the Strategy pattern
Given: The SubstrateNodeMonitor accesses shared memory then it is able to access shared resources without any Race Condition
Then: It can do so without any race conditions / errors
@simplyrider Also suggested another approach for implementing the Monitors architecture:
To make sure that the system never runs out of memory and CPU processing power is kept to a low we must make sure that as the user adds more monitorables there aren't a lot of threads/processes running at the same time. A good approach to manage this is to implement a queue which manages how many threads execute at the same time. Therefore we can have the following:
- We would have 1
MonitorsManagerthat has 1 thread listening for all type of monitorable configs and the other thread executing a batch of tasks from a multiprocessing queue every X seconds (X should vary according to how many monitorables we have). - When a configuration is received, the
MonitorManageradds a task on a multiprocessing queue for each configuration. In this task we must specify the monitor strategy to execute, and the corresponding configuration. - Once 5 seconds elapse for the task thread, the task thread grabs Y tasks from the queue(Y should very according to how many monitorables we have), checks that their configurations were not updated by the user and if not it starts a monitor thread for that configuration. Afterwards it puts the same task to the end of the queue for another monitoring round later on.
- The monitor process needs to connect to rabbit with a separate connection, retrieve data, send it and disconnect from rabbit.
Some Notes:
- It is best to stress test multiple approaches to see which once performs best. We need to test the multiprocessing approach (The one we have currently), multithreading approach, and the queue approach.
- Another important thing we must need to think about is the heartbeat mechanism. The queuing approach may require us to re-design the heartbeat mechanism, modify it for the
MonitorsManageror remove it altogether if deemed not useful.
Some resources:
- https://medium.com/omarelgabrys-blog/threads-vs-queues-a71e8dc30156
- https://testdriven.io/blog/developing-an-asynchronous-task-queue-in-python/
- http://pymotw.com/2/Queue/
- https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/OperationQueues/OperationQueues.html#//apple_ref/doc/uid/TP40008091-CH102-SW1