Non-atomic DB operations: payables

Open bertllll opened this issue 8 months ago • 0 comments

Database emergency management is never easy. It needs to be admitted though that we haven't done enough to protect the Node from potentially harmful partially executed DB queries because the application crashes. Even though we do our best to identify places where panics are possible and keep it safe from them by various measures, and even if we succeed to prevent all of them, it doesn't make us be in the clear, as we don't have any control over the host running the Node, and therefore we always should allow for crashes with such a bad timing that it could leave the persistent database in a state that will be confusing to the Node once it recovers and may cause some undefined behavior.

There are some places in the our codebase where we have already made this kind of effort, but only to some extend. There still many other places where we still are at risk.

The ultimate solution of these threats is to use strictly operations that aren't atomic, as long as the critical sequence is not executed at once. That is done by using a single DB transaction. This is highly desired, but to be fair, it increases the difficulty of writing such code.

Our standard DAO methods are quasi-non-atomic. For example, we do strive to execute updates of the same sort over multiple rows, in a single query, instead of treating each record separately and therefore a different SQLite transaction. That's good, but not enough.

Unfortunately, we also have to deal with a combination of DB operations that refer to different DB tables. This also implies that different DAOs are involved. This is why it's impossible to manifest all the operations in a single query belonging to just one DAO method.

Instead, what needs to be done is to begin the sequence by a method in one DAO, not to commit that transaction yet, draw it out and rely it to another DAO which should have its method prepared for accepting an open transaction as an argument. Inside this other method, one or more queries are done because this time the changes bear on a different table, which is the crucial difference. Only if we know that is the end of the critical sequence, meaning that we can afford to lose the entire state of our application just after, without harmful effects, then we are allowed to commit the transaction. This act will modify states of multiple tables, protected by the non-atomicity built in the code of the database manager.

(We've already been through this successfully. As of now, the 15th of June, there is a usage of this technique already. We've been concerned to perform accountancy for received payments, plus updating the scan start-block at the end while never allowing separation of these two parts. This is an important fact, because it means we are equipped with powerful tools as well that can be used in testing. There used to be times no so long ago when this was thought of as unachievable. However, now, we have even a mockable transaction (not the same as in the standard concept) which was designed right for this case mentioned here above. This should greatly help to address these issues including proper test coverage.)

Examples of vulnerable operations in our code:

Confirming a payable transaction in confirm_transactions() in node/src/accountant/scanners/mod.rs. We should do updating the payable record and marking the sent payable record as processed (was deleting the pending payable fingerprint in the old design) in one step.
The new retry mechanism exposes a little unprotected window in its core, but is looked after by additional checks on top of it, it is about the PendingTooLong Tx failures which theoretically can turn into a successful transaction just after we give it this former status. We double-check in the next interval and if we find this was case, we can still rearrange it in the DB so that we take in the successful old transaction and eliminate the newer one. However, this also require changes not only in a single DB table. We first plan to run code updating the old sent_payable record to be painted as a success followed up by deleting the failure record from the failed_payable table. We'd be better off doing these operations on a single transaction. We don't want to recover from a crash and recheck the failure record just to find out the transaction was a success and do over. And what's more, this should correctly be done non-atomically altogether with the writes into the payable table where the balance and timestamp needs to be updated at the corresponding record.
Think of more cases...

Jun 15 '25 15:06 bertllll