Node icon indicating copy to clipboard operation
Node copied to clipboard

Teach the `PayableScanner` on tx retries

Open bertllll opened this issue 10 months ago • 3 comments

Down this file, there are more details of which some may be rather confusing, especially for their abundance, but also the fact that it maybe includes references even to things already taken care of in GH-602. It inspired me to reassess this task and write more concise and accurate outlines.

The main thing that should be addressed under this card is basically the functionality that will belong in the start_scan() method of the PayableScanner, specifically its new retry mode. You will find it in scanners/mod.rs as the trait StartableScanner<ScanForRetryPayable, QualifiedPayablesMsg> where the first argument, spoke of as the "trigger message", differentiates this mode from the other one triggered by the ScanForNewPayable msg.

What should it be able to do? It will read from the failed_payable table through its DAO and collect the records that have newly come in when the PendingPayableScanner recognized them as failures. These should still be without the column "retried" filled in, with 0 respectively. The more aged failures will already have it set to 1 meaning they should not be reattempted.

Once we have the collection, we should gather the current balances for the same accounts whose wallets match the failed transaction expected receivers. That should be done via the PayableDao which probably will need a new method in order to serve that way.

We should merge these two collections, and make sure they are formed by the same structure which is required for the QualifiedPayableMsg. Remember that we do want to use this one as it is a reuse of the same message which we produce from the other mode of this scanner. Once the start_scan() methods ends, we want to continue quite identically, with the same data structures.

We, however, have to visit also the end_scan() method. Interestingly, we have an opportunity to consider completely getting rid of the mark sent payable procedure which took place only to prevent repeated payments for the same address before the first complete. First, now we know it would actually never be possible just by the design of blockchains. Second, the scan scheduling system does not enable this to happen either. The new-payable scan can never start again if the thing with pending payables (and so failed payables) hasn't been resolved yet.

Plus, we probably should introduce a flag into the Payable scanner which sets on when we enter the retry-payable scan and will signify that we're dealing with a retry. It should be unset when we end the scan (checked to be false, no matter what mode we are just finishing).

If this flag is true, though, and the recent scan ended up with some supposedly successful payments, we in finish_scan() method should make an update to the failed_payables table resulting in a one in the column for retried.

Again, we may not need to keep the logic for marking sent payables into the payable table - that implies that we may want to remove that column from the payable table. You may do it in this card also.

bertllll avatar Mar 26 '25 11:03 bertllll

The following concern has been broken down as rather groundless.

I decided to leave it here for it can unveil some niches:

Interestingly, we'll need to deal with a novelty. It actually matters if we mix those two collection up. I think we should not. With the PaymentAdjuster function in mind, we need to remember cases when the Node runner's wallet near to its emptiness. The PaymentAdjuster should then take care of selecting only such payments that are more important in some way. Now I'm thinking that the nature of a transaction requiring a retry is making this difference in importance. This aspect may need to be pursued operationally as I'm not sure about providing the best solution right away. If don't have your own one, you can grab this:

Fix the struct that is supposed to carry the qualified payables so that it has another field to store information expressing that it is a retry or not and also keeping the reason of the retry along. That shouldn't be complicated. It only requires to have an optional enum in there which translates directly from the failure codes from the database (fxxx).

Example:

pub struct QualifiedPayableAccount {
     //...various fields
     retry_opt: Option<RetryReason> 
} 

My idea revolves round the assumption that the PaymentAdjuster should later get another CriterionCalculator which will look out for this retry_opt field (because CriterionCalculator is a trait with a method which takes a structure of nested structs of which one is the QualifiedPayableAccount and thus you can easily reach out for this debated field. ) and it will evaluate the RetryReason.

This enum could be shaped as follows:

pub enum RetryReason {
      General(String),
      LostTx,
      PendingTooLong
}  

I'd suggest to ignore the first two, because that means these transactions are dead and no chances are they would ever complete. The third type bears a difference, though. It means the transaction is still tolerated in the mempool because it's only us who has opted to damn it because the waiting time doesn't go well with our liking. (In some discussions it was said, that the time limit should equal the scan interval for payables. However, we need to be wary about this place as we should probably add a constraint of the minimal value that can be configured for the interval - which could be 30 seconds, for instance.) My idea leans on the fact that under some circumstances we could get a huge set of desirable payments of which only some transactions can be selected because of fund constraints. My question is what will happen if the newly constructed set will contain a fewer transactions than the number of previously executed payments. That means one almost all transaction would be replaced with updated transactions having identical nonces but one would remain from before, untreated, and if the reason for the retry should've been too a long execution time, the transaction could stay in the pool, lingering and theoretically eventually picked by a miner. Any possible adverse effect? Losing a little money as some gas is being consumed possibly (I'm not even sure about this whether it's a real thing). Otherwise, I cannot think about anything harmful.

Plus, I thought about it thoroughly and as long as the PaymentAdjuster prioritize those accounts with the smallest surpluses above the payment thresholds, I don't even think the modeled situation could take place in the real world.

bertllll avatar Mar 26 '25 16:03 bertllll

OLDER NOTES

Status quo

When the payable scanner calls its .begin_scan(), it reads from the payable table in our database, employing the PayableDao. Historically, it has been looking for any rows with an empty column of the pending_payable_rowid. If it contained a number it meant there was a transaction in progress on the account of this creditor and we intended to let it complete first before we'd make another attempt. This gave us a list of payables where none of them indicated yet if we should or not generate a payment for those. This is resolved in the next step, where we gradually examine each account to select those whose debt has matured.

New goals

This has gone through various design stages, the last stands on a layout where we have actually two modes for payable scans.

a) Isolated payable scanner that concentrates only on completely new payments (not a retry) , That implies that in this "cycle" the PendingPayableScanner does not take part at all.

b) Pending transaction verification cycles which require a couple of scanners tied up together. The PendingPayableScanner always precedes and determines if or what the PayableScanner beyond it will do anything. This cycle is naturally more complex. It has a couple of unique features. Mainly in the PayableScanner.

Because assumed that we'll use the same PayableScanner in both cases, it means it needs to be able to run in two different modes.

The mode A is simple and agrees with the so far implemented logic. Nothing needs to be done really.

The mode B brings in a good amount of novelties. The scanner needs to reach the new sent_payable table and extract those records marked with a failure (fxxx).

Next we'll run a query upon the payable table and will want to find out the present balances for these creditors in question. The set of full fingerprints of the payments sent previously will then be converted into the standard format of the QualifiedPayableAccount. When that's done, we will modify the balance so that we always send payments with amounts which are up to date with the most lately accrued debts. This should allow to continue in the normal way from here on, by sending the set to the BlockchainBridge (It reads the balances of the consuming wallet and calls up the BlockchainAgent), then to the Accountant, which will confront the PaymentAdjuster, after whose job the gathered data goes to the BlockchainBridge again, already complete and enabling a dispatch of the transaction just as soon as the raw transactions are put together, framed in a batch and let go.

Some additional notes: The SentPayableDao will need a new read method, something like .transactions_requiring_retry().

In the BlockchainBridge, just before the actual sending of the batch, we do an operation where we want to preserve these so called PendingPayableFingerprints. We construct one for each transaction and send them over to the Accountant. Be cautious, though, because from now we will ship these into the sent_payable table. That also require structural changes for this carrying struct. Make PendingPayableFingerprint compatible with those pieces of information stored in the columns of this table. It will have to contain enough in order for us to be able to construct a new row in that table.

Another part that needs to be reviewed is the .end_scan() method. Not necessarily implied that it would have to undergo any modifications, except possibly only very minor things. It actually rather looks like this paragraph will only help with reasoning about leaving the current state of this code about the same. That is, if the act of sending the transaction succeeds per se, we will want to do as always and mark a new pending transaction. After some considerations, we may even preserve the pending_payable_rowid column, renamed to sent_payable_rowid. It wasn't an easy call, because, unlike to the former state, we will now have the earning wallet address of the creditor in both tables. (We had decided to dodge it by using the rowid as the link between tables.) However, now we possess more experience than we had when we implemented previously and it makes us believe that storing transaction history including the creditors address has the potential of an improvement in the debugging done in the future and simply the navigation through this table. Thus, I'd vote to have the wallet address in both tables. On the contrary, I may not assert to use it for a link between tables during a search. We probably still should do it by referencing to the unique rowid as we've been doing ever since.

Last note: Importantly, under this card, it's your responsibility to make sure we haven't balled up the integrity of the pieces developed separately and later put together. Do a visual check of the whole payment mechanism. Besides that, we need to end up having some kind of rigid guarantee that code works alongside the whole length of the payment processing and with all those forks in the code flow which we could encounter based on the variability and uncertainty of the responses returned to us from our blockchain service provider. Ideally, this should be tested in an integration test, which lies by its structure somewhere between unit tests (testing only a distinct code logic, a small part of a whole) and our standard integration tests which always approach the code across the complete binary file of our software. Something like what was attempted in this test pending_transaction_is_registered_and_monitored_until_it_gets_confirmed_or_canceled, even though we know already, from historic interactions with this test and its maintenance that it is very difficult to keep it up with new changes. The problematic maintenance is such a serious factor that it quite advise itself against creating more tests of this complexity. (The test was written with the idea in mind that it will include every step in the process of checking on a few pending transactions. One completes successfully and one is a failure detected from the outside world.). Still, it would be nice to have the whole cycle tested, including a transaction that is found pending too long, another is missing and maybe one completes right away in the first cycle while the two other ones will complete in the upcoming cycle, after a retry.

bertllll avatar May 27 '25 21:05 bertllll

Objective of the card:

  1. Find the failed payables
  2. Look into the payable DAO to update the amount
  3. Prepare QualifiedPayables (it may not be there)

utkarshg6 avatar Jun 24 '25 15:06 utkarshg6

REGARDING THE NONCES

When the PayableScanner in the retry payable mode starts, it composes a collection of payables. The accounts qualifies in from the failed_payable table and where the column failure_status says RetryRequired. Based on that we fetch the corresponding PayableAccounts from the payable table.

The PayableAccount then becomes a part of the UnpricedQualifiedPayables struct which internally holds a collection of those accounts coupled with the gas price value from the previous attempt that is read out from the failed_payable table. The struct holding it is called QualifiedPayablesBeforeGasPriceSelection.

Besides that, we also need to bundle in the information about the previously used nonce. So that every retried tx can be tracked back to the original tx that failed as the first attempt.

It's advisable that the structure now may become this:


// This struct is just modified a bit
#[derive(Debug, PartialEq, Eq, Clone)]
pub struct QualifiedPayablesBeforeGasPriceSelection {
    pub payable: PayableAccount,
    pub previous_failed_attempt_opt: 
           Option<
                 PreviousPayablePaymentAttempt
           >,
}

// This a completely new struct
#[derive(Debug, PartialEq, Eq, Clone)]
pub struct PreviousPayablePaymentAttempt {
    pub gas_price_minor: u128,
    pub nonce: u64
}

However, the structure PricedQualifiedPayables that comes from the processing of UnpricedQualifiedPayables is already indifferent to the mode in which the PayableScanner runs. That must change as we need to store the nonces even there. I suggest creation of an enum which will replace the struct QualifiedPayableWithGasPrice (Please, once you do that, just use the select&replace tool and apply correct names for variables pointing to in both the prod code and test tree.)

Something like


pub enum TxContextByPayableMode {
     NewPayable{ gas_price_minor: u128},
     RetryPayable{ gas_price_minor: u128, previously_used_nonce: u64}
}

which I'd prefer to be placed inside the renamed QualifiedPayableWithGasPrice to QualifiedPayableWithContext like:


// Renamed QualifiedPayableWithGasPrice... but still taking the same place in the code. It's a substitution. 
#[derive(Debug, PartialEq, Eq, Clone)]
struct QualifiedPayableWithTxContext {
    pub payable: PayableAccount,
    pub tx_context: TxContextByPayableMode,
}

When you get to the place where we construct the txs before we sign them, you will meet the place we make an rpc call to query the transaction id. Now the change is that we want to do the call only if we run in the new payable mode. Otherwise, the supplied nonces together with the payable accounts should be used instead.

bertllll avatar Jul 18 '25 13:07 bertllll