Connection to MariaDB fails due to reaching connection limit
It happens that we have connection issues to our MariaDB. This is an intrinsic problem of having multiple components in a software system. Networks, databases and in general, software fail.
The last episode of connectivity issue has been a crash while executing "_fire_queries" from blocking.py. MariaDB was rejecting connections because it was overloaded. Soweego relies strongly on multithreading and mix-n-match works on our same database, thus maxing out the pool of connections can happen frequently.
It's undeniable that a pipeline that takes weeks to compute the results, cannot fail just for a temporary malfunction.
I propose a new design approach in our code. It should be tried in the blocking code cited above.
The main idea is having a queue of queries shared among the thread. Each thread draws one of them and runs it. Running a query has three possible outcomes: Success: everything goes on as designed. Failure: something went wrong, but it's because of some temporary issues. Connectivity issues belong to this case. Fatal error: the query is somehow malformed and needs to be thrown away.
After a failure, we enqueue the failed query. After a timeout, we can then dequeuing the next query. This timeout should grow if the failures are consequential and reset to the default value a success happens.
It would be great building a reusable component that will act as a black box for the developers.
This is the overall design idea, but I'm sure the implementation will hold some interesting challenges.