backrest icon indicating copy to clipboard operation
backrest copied to clipboard

Check does not check the entire repository

Open SiddharthManthan opened this issue 1 year ago • 7 comments

This is continuation of #75.

Problem

  • Currently check randomly selects a part of repository.
  • This does not guarantee the all the data will be check after sufficient runs. (Source)

Solution

  • Implement --read-data-subset=n/t (Documentation).
  • It will check all the data after sufficient runs.

I have filed it as bug report because the UI currently does not mention that all data is not checked. There should be a warning or proper check support should be implemented.

SiddharthManthan avatar Jul 22 '24 10:07 SiddharthManthan

Does this repo setting not fit your use case? Setting this to 100% will re-download the whole thing ...

image

Thinkscape avatar Jul 22 '24 23:07 Thinkscape

Does this repo setting not fit your use case? Setting this to 100% will re-download the whole thing ...

image

The repo is quite large. Downloading all the data will take a long time. It might not even be possible because the machine is not running 24*7.

SiddharthManthan avatar Jul 23 '24 10:07 SiddharthManthan

hey, --read-data-subset=n/t is only guaranteed to check the whole repo if you systematically increment n which isn't something I expect to implement as that value would need to be tracked somewhere which adds complexity relative to the value of the feature. If you're using storage that ensures it's integrity e.g. S3 etc I think there's also very limited value to actually verifying data integrity by downloading it, this is supported primarily for users of local storage or sftp.

The alternative would be checking random chunks of the repo, but this likely isn't better than using a random percentage.

garethgeorge avatar Jul 23 '24 19:07 garethgeorge

hey, --read-data-subset=n/t is only guaranteed to check the whole repo if you systematically increment n which isn't something I expect to implement as that value would need to be tracked somewhere which adds complexity relative to the value of the feature. If you're using storage that ensures it's integrity e.g. S3 etc I think there's also very limited value to actually verifying data integrity by downloading it, this is supported primarily for users of local storage or sftp.

The alternative would be checking random chunks of the repo, but this likely isn't better than using a random percentage.

Check not only checks the integrity of repository files, but also checks for repository corruption. There have been many cases where the repository gets corrupted even though the blobs themselves are error free (due to bug in restic code or other reasons). So check is very useful for cloud storage, even if cloud storage ensures integrity. If corruption is detected before data needs to be restored, it can be fixed in timely manner. The feature does have value, even if it increases the complexity.

SiddharthManthan avatar Jul 24 '24 06:07 SiddharthManthan

Check not only checks the integrity of repository files, but also checks for repository corruption.

That is true even in the current version of backrest. If you had set it to 0.001% (0% might work as well) it already checks the indexes, and stats the packs.

image image

Thinkscape avatar Jul 24 '24 23:07 Thinkscape

Check not only checks the integrity of repository files, but also checks for repository corruption.

That is true even in the current version of backrest. If you had set it to 0.001% (0% might work as well) it already checks the indexes, and stats the packs.

image image

But it will miss the data blobs with 0.001 %. Only by reading the data, we can detect errors.

Here is an example of data corruption. All files in repo (index, data blobs, etc) appear fine and checksum correctly but the actual data (after restore) will have different checksum. There have been such bugs in the past (look in forums).

These corruption can only be detected by full data read.

SiddharthManthan avatar Jul 25 '24 08:07 SiddharthManthan

It seems that using a percentage will eventually check every pack with probability. For example, at 25%, it will most likely check every pack after at most six runs.

I would suggest that if you are interested in this feature that it would be be easier to figure that out than to implement this request.

Additionally, if using subset=n/t - the complete operation (checking all sets) must be completed within a short time frame. The longer duration between checks of a set the more packs are added that will not be checked. The set size is growing.

I think subset=n/t feature is really designed to be used if you have multiple machines and using clustered storage, you could then run the check concurrently on t number of machines. It is not for a single machine to check.

For a single machine, using a percentage (and understanding a 25% check will not be complete in 4 checks) is superior, as you'll be checking random packs and have a higher chance of finding the needle in the haystack.

sedlund avatar Sep 29 '25 09:09 sedlund