Optionally skip work when performing catchup
This has the potential to make "catchup" much faster in general.
When we perform "catchup", when applying a ledger, we have a trusted data set that we can use to avoid performing certain expensive operations, in particular we have access to the transaction result.
The transaction result allows us to derive the following:
- if a transaction was successful, it means that the signatures present were valid
- this means that we can skip the actual signature verification at apply time (the crypto part, we still need to accumulate weights to ensure that we preserve the logic that discards one time signers)
- if a transaction failed
- we need to perform signature verification
- we otherwise know that we can skip the actual application of operations (so we save on I/O and CPU there)
We could add a couple config options to toggle those behaviors (opt-in at first):
- when we validate builds using historical replay as test corpus
- I think that disabling signature verification but keeping the application of failed transactions is probably a good middle ground.
- I don't know if it's a full "turn it off" or something partial, like perform signature verification based on some condition like
H(source account)or the number/type of signatures. I suspect that if the code is written where the only difference is "crypto", then it's safe to turn it off (and have the actual code path unit tested).
- when rebuilding history archives, both can be turned off.
With #2813 done, I think it will be safe to have this enabled by default on all production setups.
I've discussed this idea
if a transaction failed
- we need to perform signature verification
- we otherwise know that we can skip the actual application of operations (so we save on I/O and CPU there)
and think that it would save a lot of replay time. But I'm not sure if we should make it the default. Some operators might be expecting to actually confirm all of history, and changing the default will instantly invalidate those assumptions. We should leave this as an opt-in performance improvement, but advertise it aggressively (for example, if you run catch-up then stellar-core should log about the existence of this option if it isn't set).
@MonsieurNicolas why is it required to perform signature verification for failed transactions? something related to one-time signers?
ah, looks like this is because we still remove one-time signers: https://github.com/stellar/stellar-protocol/issues/495 (cc @ThomasBrady)
I was chatting with @anupsdf yesterday and didn't realize we were considering only this issue partially (only skipping signature verification), and for this the impact may not be large enough (I realized that there are edge cases where skipping is not possible in all cases) while making the code harder to maintain.
So how much savings do we see when disabling signature verification (without condition at first as part of a quick eval) on a longer catchup (like pick a few ranges that take something meaningful like an hour to replay excluding bucket apply) @ThomasBrady
Hey, the current plan is to skip signature verification and validity checks for all transaction types and to skip application for failed transactions, outlined here. I have a branch with this implemented, but I haven't run any benchmarks yet to see the difference vs normal catchup, I will share those numbers when I do.
I see @ThomasBrady . Yeah we definitely need some idea of potential saving for each part (like: is skipping crypto high impact) of this as it potentially creates a bunch of future maintenance burden. We also need to know the impact of the extra overhead of downloading results (that is entirely new) in the context of supercluster (also results are fairly large, which may impact disk requirements)
I see @ThomasBrady . Yeah we definitely need some idea of potential saving for each part (like: is skipping crypto high impact) of this as it potentially creates a bunch of future maintenance burden. We also need to know the impact of the extra overhead of downloading results (that is entirely new) in the context of supercluster (also results are fairly large, which may impact disk requirements)
I have some results from locally running catchup on 1000 ledgers:
user/system/total time seconds
*Baseline (no skipping):* 429 / 115 / 138s
*Skip Failed:* 373 / 99 / 114s (1.14x / 1.16x / 1.21x speedup over baseline)
*Skip Failed + verification:* 334 / 88 / 95s (1.28x / 1.30x / 1.45x speedup over baseline)
So skipping failed transactions, with the added overhead of downloading results, resulted in a 1.21x speedup and skipping both failed txns and signature verification resulted a 1.45x speedup. Obviously running in supercluster will give us more insight, but it seems like the savings are worth it based on these results.