borg recreate optimisations
The case: big repo (1 TB for example) with multiple snapshots and a lot of archives inside it, lz4 was used.
The problem: borg recreate with --exclude and --recompress to zstd is extremely slow and seems to process every single snapshot separately.
Why not to repack with exclude to exclude desired big archvices inside (the speed is expected like in prune action, I think) and then recompress chunks as chunks without treating them like files?
Currently it takes weeks on HDD raid mirror. Is it possible to optimize?
Please be more specific: which (full) command exactly takes too long?
borg recreate --compression auto,zstd,3 -svp --show-rc \
--exclude '*/backup/*.tar.*' \
--exclude '*/*cache/*' \
--exclude '*includes/sessions/sess*' \
--exclude '*livestreet_cache*' \
--exclude '*cache/dbcache*' \
--exclude-caches --exclude-if-present '.nobackup' \
--recompress if-different \
${repo}
OK, so you recompress "if different" AND you exclude stuff.
So how do you think this can get faster?
I'm not sure how it is done internally, but why it should process all snapshots as files instead of recompress chunks as chunks?
So it can be only two passes:
- repack with filter to remove some files
- recompress chunks
If you exclude stuff, it has to process all archives and look at all files in there anyway.
If you only recompress "if different", it will recompress a specific chunk only once. The next time it looks at it, it won't be "different" any more.
The next time it looks at it
In case of big repos, it takes too much time and it looks like it process it somehow every time.
I thought it has some chunk index so it should be easy to get chunks with one compression and convert it to another. Instead, as I see with iostat - it rereads from disk near whole repo for each snapshot.
Note: the chunks index only has id -> refcount, size, csize.
Question: does it read the whole chunk to determine the compression type?
Answer: yes, see ArchiveRecreator.chunk_processor:
if recompress and not always_recompress:
old_chunk = decrypt(id, repo.get(id), decompress=False) # read, authenticate, decrypt
compressor, level = Compressor.detect(old_chunk)
if compressor, level == as wanted:
do not overwrite old chunk
And this is not possible in any other way because the AEAD decryption always needs to process the whole chunk and the compression type is part of the encrypted payload, together with the compressed data.
ok, so, as far as I understand now it should be enough only single pass of reads chunks from disk if I do only recompress, correct?
But with
borg recreate --compression auto,zstd,3 -svp --show-rc --recompress if-different $REPO
It processes each snapshot separately
And, again, if I only filter, without recompression - it process each snapshot separately too:
borg recreate -svp --show-rc \
--exclude '*backup/*' \
$REPO
So, the main question, is it possible to optimize both processes, so each requires no more, then one read of each chunk from disk? per repo, not per snapshot as it seems to be now.
If you do actions which influence the contents of an archive (by recreating it with new excluded files), it of course need to read each archive, how else should it be able to create a new one from it?
About global recompression you are right, it COULD just recompress all chunks (getting the set of chunk ids from chunks index). But IIRC that is not what it currently does, it just goes over all archives and discovers chunk ids from the items chunk lists (as it would if you only recreated a single archive).
Another data point:
I'm recreating a 300 GB repository right now, just to change compression to auto,zstd,(some-level-that-I-forgot). Nothing is excluded or changed other than this, and I use if-different.
For one of the first archives, this took about 4 hours as it involved recompressing about 200 GB of data (some of which isn't very compressible and presumably isn't compressed at all).
Each of the subsequent archives takes around 30-50 minutes each. Each such archive is about 200 GB where some files have disappeared and others are new or changed. This means the archives are processed at around 70-110 megabytes per second. The CPU usage is constantly high.
I interpreted this as meaning that Borg is decompressing each archive as it goes (causing high CPU usage), re-chunking everything (presumably into the same chunks as before), checksumming each "new" chunk, and detecting that almost all of them are already there (from the preceding archives that are already processed). But maybe this is wrong?
IIRC, I've seen code that activates re-chunking only if chunker params are different. So, in that case it would be: read, authenticate, decrypt, decompress, rechunk, compress, encrypt, authenticate, write.
But if it does not re-chunk, that expensive processing should be only needed if stuff needs to get recompressed (minus the chunking in that case).
More thoughts about how it can be better:
- borg can create index of what must be changed - first pass of reading from disk
- borg can filter+recompress chunks from pass 1 - second reading from disk
- borg can recompress all other chunks as chunks - reading last chunks
So, only three passes, two reads repo from disk, MUCH faster then process snapshot by snapshot.
As for now, even repo with 15 GB of data (3 GB on disk compressed+deduplicated) takes hours to filter+recompress if it has 100 snapshots.
IIRC, I've seen code that activates re-chunking only if chunker params are different.
What I can find in archive.py is first some code that activates rechunking in ArchiveRecreater only if chunker params are set:
self.rechunkify = chunker_params is not None
if self.rechunkify:
logger.debug('Rechunking archives to %s', chunker_params)
But even if I haven't specified any chunker params on the command line, they are still set here -- apparently because of the parser setup for the --chunker-params parameter for recreate, which sets default=CHUNKER_PARAMS. Then it would seem you can't get away from having rechunkify active here.
Of course there is another safeguard, in create_target():
target.recreate_rechunkify = self.rechunkify and source_chunker_params != target.chunker_params
This would mean that you don't rechunkify even if chunker params are set, as long as they are the same as the chunker params in the source archive, which should work fine in Borg 1.1. But I have some archives created with borg 1.0.x, and as indicated by the help for borg diff, those archives don't appear to have chunker params stored explicitly, so source_chunker_params is None. Then it seems to be impossible to avoid some rather expensive rechunking regardless of how you set target.chunker_params through the command line, which causes all files (in my current case 3 TB) to be re-read for each archive. And when you are using a remote target (SSH), it seems the files are read from the remote target to the initiating computer in order to be re-chunked there.
I tested setting default=None for --chunker-params, and then borg does not re-read all data to re-chunkify it. Now it generally takes 14 seconds per archive to filter out some unnecessary files, instead of 14 hours.
But I don't know if this has any problematic effects elsewhere.
I've got an old attic repository I've upgraded and I'm re-chunking and recompressing now. The whole repository is about 1.5 TB on disk and contains some 50 archives, each with about 500 GB of original data. Although this by all rights should be a slow operation, I think it could be much faster.
The behaviour I'm seeing is that even when two archives are very similar (because they are two point in time snapshots not far apart), and result in minimal deduped output data (like 5 GB out for 100 GB in), each archive takes about 3 hours to process.
At this rate the whole repository will take something like 150 hours to recreate. You might say that's not unreasonable if the repository contains about 50*500 GB of uncompressed data and we insist on processing all of the data. We're getting an effective throughput of 47 MB/s which is respectable since there's also writing and compression activity.
Even so, it seems a small change could make it much faster.
What I'm thinking is that Borg's normal method for detecting unchanged files isn't being used. So even if Borg just read file a.dat in the previous archive and re-chunked it, it's going to extract exactly the same file in the next archive and process the whole thing.
If Borg did what it normally does it would detect a.dat is unchanged based on its path, ctime, size and inode tuple. Then it can just skip decompressing the old file and go straight to writing the output.
I think at the core borg recreate could treat each archive as a virtual filesystem and then run normal borg create on it. So all the normal caching behaviour would kick in. If a file would have been skipped according to the caching criteria when doing borg create from a real filesystem, it would be skipped also when doing it from a previous archive.
And as a bonus, if a future version of borg create is multi-threaded, borg recreate would benefit for free since it's just reusing borg create in a loop.
What I'm thinking is that Borg's normal method for detecting unchanged files isn't being used.
True, because the "normal method" (with borg create) is working by comparing filesystem file against files cache contents.
OTOH, what happens (IIRC) is this:
- first file gets re-chunked, hashed, lookup "do we already have the chunk with that hash?", if not: compressed, encrypted, authenticated, all new chunks transmitted to and stored into repo.
- second (same, unchanged) files gets re-chunked, hashed, looked up and then it notices it already has the chunk with that hash in the repo, for all the chunks of that file.
So you see, it is not doing the whole work for each file, but the chunking and hashing happens always.
So you see, it is not doing the whole work for each file, but the chunking and hashing happens always.
Right, that's the point I'm trying to get at. Because every file is chunked and hashed every time, this is different and slower than borg create. borg create has its cache so when it sees the same path, ctime, size and inode a second time, it can just look up the hash without reading the file. Why shouldn't borg recreate have a cache too?
Come to think of it, it can be even simpler than what I suggested. Use a content hash to cache re-chunking results.
Let borg recreate have its own cache while working. It stores hash(list of old chunk hashes) -> list of new chunk hashes pairs. This cache basically answers the question, "When I had a file with chunks X and I re-chunked it, what new list of chunks did I get?"
Now when it sees any file, it looks up its list of chunks from the item meta data. It hashes that list and looks for it in the cache. If there is a match, it means that recreate already re-chunked exactly that file's content before (because if the contents were different it'd have different chunks). We skip the re-chunk and use the cached value.
So in pseudo-code:
item = get_item(a_file)
content_hash = hash("".join(chunk_hash for chunk_hash in item.chunks))
new_chunk_list = recreate_cache.get(content_hash)
if not new_chunk_list:
new_chunk_list = chunkify(a_file)
recreate_cache[content_hash] = new_chunk_list
new_item = copy(item)
new_item.chunks = new_chunk_list
Here's an example: let's say we have a borg repository which has 100 archives. Each is a snapshot of a filesystem containing just 1 file, cats.txt, and the file never changes. The file compresses and dedups really well, so it's 1 GB original data but just 1 MB on disk. Obviously not realistic but bear with me.
Old method (based on my imagination of how borg recreate probably works today):
- For the first archive, (1 time)
- Extract
cats.txtand re-chunk it, (extract 1 GB of data and re-chunk it, 1 time) - Write any chunks missing, (write 1 MB, 1 time)
- Write file metadata, (1 time)
- For each remaining archive, (99 times)
- Extract
cats.txtand re-chunk, (extract 1 GB of data and re-chunk it, 99 times) - Write any chunks missing, (write nothing, 0 times)
- Add the file to the new archive (99 times)
New method:
- For the first archive, (1 time)
- Extract
cats.txt. Its content hash is not in the cache, so re-chunk it, (extract 1 GB of data and re-chunk it, 1 time) - Write any chunks missing, (write 1 MB, 1 time)
- Write file metadata, (1 time)
- For each remaining archive, (99 times)
- Get new chunk list from the cache, (extract nothing, 0 times)
- Write any chunks missing, (write nothing, 0 times)
- Add the file to the new archive (99 times)
So the old method has to re-chunk 100 GB of data to output 1 MB, while the cache version re-chunks 1 GB to output 1 MB.
This method doesn't help with files that partially change between each snapshot because the list of chunks would change. But for many kinds of "backup the whole computer every day" scenarios, I bet 99% of all files are identical. Borg recreate would be massively faster in those cases at the cost of just a small hash table.
One nice thing with this method is that unlike the "path, ctime, size and inode tuple" cache borg create uses, which in theory can be "fooled", this method can't be. The key we use to see if two files differ is based on the actual content, so the only way the key matches is if the content matches (assuming a cryptographic collision resistant hash). We have this luxury because we have the list of chunk hashes in existing archives, unlike what borg create has to work with.
Also note that if for some reason two different files have the same hashed content, the cache would also prevent those files from being re-chunked twice. So e.g. if cats.txt gets renamed to dogs.txt it still doesn't need to be rechunked because its content hash remains the same.
This cache would only be used when re-chunking, so it's a little niche. But re-chunk right now is slow.
I'm running this on a fast SSD drive with only 50 archives and it's taking 150 hours. For folks out there with magnetic drives, and a similar number of archives, I imagine it might take what 1500 hours to do the same? At that point it's like so slow it's practically not actually possible to re-chunk a repository like that. But if you don't re-chunk you lose some dedup benefits after the borg upgrade.
@aljungberg That sounds good! Do you want to make a PR against master for that?
For the first approach, one could even just map tuple(old_chunks) -> tuple(new_chunks). Needs a bit more memory, but can be easily optimized later.
Of course such a feature might need quite a bit of memory, but it's not significantly different from what the files cache needs now anyway (maybe one should make sure the files cache is not loaded for recreate, I hope this is already the case).
For recreate (rechunk) this obviously should make it quite faster.
If we do not rechunk, but just recompress I guess it won't make it faster because in that case the chunk ids list stays the same (the id is computed from uncompressed unencrypted contents), just the compression changes.
I'd love to make a PR but as time passes it seems less and less likely I'll find the time! I'm still keeping this in mind but if anyone else wants to take a stab they should.
Note: in master branch, I added "flags" to the repo index, so we could flag all chunks in the repo "needs to get recompressed". Then it could go over the chunks, recompress and unflag them. Commit once in a while.
Even if interrupted, the flags in the repo index would "know" what's left to do. If all flags are cleared again, we're done.
This would be completely independent of archives and just re-compress each chunk only once.
Also: in master branch, there is no csize any more in the chunks index, neither in the archives - makes recompressing stuff much easier as there is no need to update all that stuff.
#7037 is doing repo-wide recompression.
i didn't even need the flags because the ctype/clevel is in the now separately available metadata.