detect renamed files and avoid file transfer
This is a copy of the old bugzilla issue from here: https://bugzilla.samba.org/show_bug.cgi?id=2294 this certainly would be a big win in many cases. It is complicated by the incremental method of calculating the hashes (we don't hash the full file list before starting transfers).
note that --fuzzy is a partial handling of this issue, key problem is it only looks in the same directory. Extending this to be able to look across the whole destination tree, perhaps with sort by file size for faster matching, would make it more useful
I would like to propose a --fuzzy2 option in rsync, which also considers the entire tree.
I have a prototype written in awk, which I currently run before the actual rsync run.
It's not perfect, could be done better probably, but it works and saves me a lot when doing file/folder renames. It will only move files, not folders. Awk delta calculation for 10K Files in source * 10K Files in target ~ 0,3s
Creating the required folder tree in target must be done at first, e.g. with rsync :-)
rsync -a --include='*/' --exclude='*' "${sourepath}" "${targetpath}"
AWK: source and target infos must be put into an array:
array format:
Files_last_modification_time _ filesize filepath
e.g.
1718541359.8524070000t-147 /home/claus/.bashrc
1717861293.8939940000t-57 /home/claus/.bash_profile
The first column is the key-id, here date + size. 2nd column has the file path. Date+size key must be replaced by hash, when using --checksum.
populate the array:
aa source array
bb target array
awk main loop
(x == key-id)
( aa[x] == file path source)
( bb[x] == file path target)
for (x in aa) {
if (x in bb) {
if (aa[x]!=bb[x]) { print "mv --no-clobber targetpath""bb[x]" "targetpath""aa[x]" }
delete bb[x];
}
}
After reviewing and executing the proposed mv commands, I run the real rsync, which cleans up remaining things.
Here are a few additional details I found with a quick search, partially for my own reference.
--fuzzy
It is apparently recommended to use --fuzzy --delay-updates --delete-delay together.
Patches
There are a couple of patches available (they have existed for a long time):
https://github.com/RsyncProject/rsync-patches/blob/master/detect-renamed.diff https://github.com/RsyncProject/rsync-patches/blob/master/detect-renamed-lax.diff
As of 2021, there were no plans to merge them.
I have not tried them yet. I am not sure whether they patch cleanly, whether it'd be enough to apply the patches locally, or whether rsync would have to be patched on both ends for this to work when syncing to/from a remote.
Tools
There are a few tools for this available, but I didn't have a lot of luck with them.
https://github.com/m-manu/rsync-sidekick works for local transfers. See https://github.com/m-manu/rsync-sidekick for some more links to various tools that do remote transfers. I tried https://github.com/gbabin/rsync-prelude; it was quite slow for me when working on many relatively slow files since they call ssh md5sum for each file separately, instead of computing the checksums with fewer commands.
BTW, naively, it would seem that a principled way to solve this would be if the rsync algorithm worked as well for a directory of files as it would for an uncompressed tar archive of those files, where renaming of files inside the archive should not cause rsync any trouble. In other words, it'd be cool if rsync could deduplicate the transfer of identical data blocks no matter what file they are from. This would work better than whole-file hashing solutions, since it should work for files that are renamed and then slightly modified.
I am not sure how well the rsync algorithm would work in practice for such a huge file.
I think restic/borg do this kind of deduplication, but I do not know whether they need to do something beyond the rsync algorithm to achieve it.
In theory, instead of rsync-ing a large dir from X to Y, one could create a restic backup snapshot of the dir on Y (with the backup also stored on Y), then back up the dir from X to the same repository on Y. This should only transfer the data differences between the dir's state on X and Y, regardless of renames. I wonder whether or not this backup would be faster than the rsync transfer of the same data if many files are renamed. If rsync is currently slower than that abuse of restic, it'd be cool for rsync to learn the optimizations necessary to be as fast.