foundationdb Further improvement of the performant restore

The first version of the performant restore will be released in FDB 6.3 as an experimental feature. It thus closes the issue https://github.com/apple/foundationdb/issues/1049.

The performant restore can still be further improved in various areas. This issue tracks the further improvements for the issue https://github.com/apple/foundationdb/issues/1049.

Note: Fault tolerance will be handled in a separate issue as a follow-up project.

Performance improvement

[ ] Performance bottleneck 3: Straggler can happen for both applier and loader roles. When it happens, the current implementation will have to wait for the straggler. Solution. We can stop the straggler and redistribute the workload of the straggler to all nodes with the same role. Challenge. When the straggler is applier, it may have already apply some atomic mutations. Reapplying atomic mutations will violate result correctness. We need to bookmark the progress of the straggler and only execute the rest of mutations.
[ ] Performance improvement 6: If a loader knows the versioned key-range of all range files, it can directly filter out the log mutations in the key range with smaller version. This can reduce the amount of data sent from loaders to appliers;
[ ] Performance improvement 7: The start version can be the snapshot version (i.e., range file version) instead of 0. We can also use range file version to skip log mutations that are below range version and that are in range file's key range;
[ ] Performance improvement 8: Balance workload on loaders for parsing range and log files across version batches;
[x] Performance improvement 9: Filter out mutations whose version is smaller than the smallest version of range file.

Memory saving improvement:

[ ] Master does not need to keep the sample's key's value. Loaders send only the keys and metrics of parsed mutations.

Robustness improvement

[ ] Sanity check range files cover all key spaces;
[ ] Sanity check collected backup files are restorable before executing the restore;
[ ] Abort an ongoing restore and restart a new restore in simulation test.
[ ] A log file can be 1.2GB or larger when a FDB cluster becomes larger, we should expect to handle the case when a loader cannot hold all data in a log file;
[ ] A version batch can be hundreds or thousands of GB in extreme cases, we should handle the case when the restore cluster cannot hold all data in a version batch in memory.
[ ] Add sanity check on the actor execution sequence: for example, handleSendMutationVectorRequest should not have any effect when handleApplyToDBRequest starts;

Visibility improvement:

[ ] Which process/restore role die (due to OOM);
[ ] How much time each role spent in each phase for a version batch. This helps understand the bottleneck;
[ ] Characteristics of the backup data in the processed data, such as number of sets, and different types of atomicOps (e.g., setVersionStampedKey, atomic Add);
[ ] Add status reporting as existing fdbrestore status does to report errors and status when the new restore is running;

Code improvement

[x] Refactor typeString[m.type] to function that checks if type is valid;
[ ] Remove the unit test file for "/FastRestore/RestoreLoader/splitMutation" after the new restore rolls out;
[ ] getVersionSize() speed can be improved if all log files whose endVersion is smaller than perVersion are pre-filtered out;
[ ] In case restore master dies in the middle of the restore, (e.g., because configuration of a version batch threshold is too small), it should notify other restore processes. Also consider the following situations: i) what if a dead restore process is automatically restarted? ii) Does this have any effect on the running restore operation (i.e. does it cancel it, etc)? iii) Have a way to push this kind of information back through the restore status mechanism.
[ ] Revisit splitMutation() function. Try to re-use existing functions.
[x] Merge mutationMap used on loaders to concatenate log mutations. The maps are std::map<Standalone<StringRef>, Standalone<StringRef>>* pMutationMap, and std::map<Standalone<StringRef>, uint32_t>* pMutationPartMap;
[ ] Combine applierMutationsBuffer and applierVersionsBuffer;
[ ] Clean up unused fields and code;
[ ] Filter out backup files that contains duplicate mutations: when backup agent fails (due to timeout) in the middle of recording mutations for a range, the already-read mutations have already been written to backup files. When the next backup agent retry the transaction, it will create a backup file that contains some or all of the mutations in the half-done backup file;
[ ] Revisit tr.reset() and remove unnecessary reset();
[ ] Consider supporting multiple restore roles per restore worker;
[ ] Handle protocol version in the backup files -- older or newer binary can generate incompatible files

Post PR #2908

[ ] Consider allowing empty files in version batches and validate the invariant Invariant: The last vb endverion should be no smaller than targetVersion if the backup is restorable. Note that when computing version batches, raw backup files, including empty files, should be present for version continuity analysis.; Add a comment for this.
[ ] Change the SevFRMutationInfo TraceEvents to debugFRMutation() so something similar that allows DBA to configure online which parsed mutations to dump into analysis tools, such as Splunk, for analyses. One use case is that user may use the new restore system to dump the backup data to Splunk and use Splunk to query the mutations. <-- wait for PR https://github.com/apple/foundationdb/pull/2873 and add something similar.

Apr 13 '20 19:04 xumengpanda

Suggestion from Steve:

For backup/restore tests, perhaps the knob that controls shard count should be turned way up. This can help increase the test intensiveness.

Apr 15 '20 23:04 xumengpanda

Visibility improvement:

Each phase and sub-phase on each node should have the field: batchIndex and nodeID

May 01 '20 23:05 xumengpanda

Restore tooling:

[ ] User should be able to specify multiple key ranges at the same restore version, and at different restore versions in one shot. RestoreMaster should do the planning to consolidate these requests to avoid going through the backup data multiple times.

May 04 '20 03:05 xumengpanda

[ ] Disable the periodic metric tracing of a VB when the VB finishes.

May 04 '20 04:05 xumengpanda

[x] Free restore master sampling memory once it is no longer needed, which allows the master to handle more inflight version batches.

May 04 '20 05:05 xumengpanda

[ ] Protect RestoreMaster from out of memory. Circus run shows that the restore master can get OOM.

May 04 '20 18:05 xumengpanda

Hello! I'd like to do some performance testing of FastRestore and to compare it with another ways of database restoring on my environment. Is something in upstream ready for performance testing?

Jul 09 '20 10:07 oleg68

Hello! I'd like to do some performance testing of FastRestore and to compare it with another ways of database restoring on my environment. Is something in upstream ready for performance testing?

Hi, it is not ready yet. I tested in small cluster and waiting to test it in bigger cluster and different types of backups.

Jul 09 '20 16:07 xumengpanda