Cache in RAM for presence of blocks

Open sourcefrog opened this issue 5 years ago • 0 comments

When writing a backup, Conserve checks whether hash-addressed blocks are already present in the archive, before bothering to write them.

This turns into just one stat per file, which is normally pretty cheap, but on slow devices, or network filesystems, or even more so on remote storage, it can account for a significant fraction of the time. (The cost could be mitigated by dispatching more write work in parallel, but it's still not really necessary.)

It could be better to instead keep a set of known-present blocks in memory for the duration, so that the second lookup for any given file is faster. The set should, of course, be updated when new blocks are written in.

I think a naive Bloom filter won't work here, because a hit must mean the block is definitely present and need not be written. But perhaps there's some other smarter option than remembering all the present blocks.

In general I'd like to avoid memory usage for backup being proportional to the existing archive size, so perhaps this should be capped in size or use an adaptive replacement cache.

As a complication, it might be useful to proactively and in parallel read in a list of every present block.

The cache can also learn from basis indexes: we know any block referenced there must already be present.

May 21 '20 22:05 sourcefrog