soweego icon indicating copy to clipboard operation
soweego copied to clipboard

Connection to MariaDB fails due to reaching connection limit

Open MaxFrax opened this issue 5 years ago • 0 comments

It happens that we have connection issues to our MariaDB. This is an intrinsic problem of having multiple components in a software system. Networks, databases and in general, software fail.

The last episode of connectivity issue has been a crash while executing "_fire_queries" from blocking.py. MariaDB was rejecting connections because it was overloaded. Soweego relies strongly on multithreading and mix-n-match works on our same database, thus maxing out the pool of connections can happen frequently.

It's undeniable that a pipeline that takes weeks to compute the results, cannot fail just for a temporary malfunction.

I propose a new design approach in our code. It should be tried in the blocking code cited above.

The main idea is having a queue of queries shared among the thread. Each thread draws one of them and runs it. Running a query has three possible outcomes: Success: everything goes on as designed. Failure: something went wrong, but it's because of some temporary issues. Connectivity issues belong to this case. Fatal error: the query is somehow malformed and needs to be thrown away.

After a failure, we enqueue the failed query. After a timeout, we can then dequeuing the next query. This timeout should grow if the failures are consequential and reset to the default value a success happens.

It would be great building a reusable component that will act as a black box for the developers.

This is the overall design idea, but I'm sure the implementation will hold some interesting challenges.

MaxFrax avatar Mar 22 '20 15:03 MaxFrax