Circular symlinks may cause sync to spin (almost) forever
Filed so that it is tracked separately from #390 / #512
@zackse will you try to fix this one? It's a serious "help wanted" task, not a tiny "up-for-grabs" change.
It all comes down to available cycles... Can't commit just yet, but I'm game for discussion. A few thoughts:
If we don't have a convention for storing symlink metadata in the cloud, does it make sense to ever transfer symlinks? Perhaps -- it just means there are multiple copies of files/directories stored and a restore is not an exact replica of the source data.
For directories, if we (naively) transfer only the first occurrence of an inode, but we don't have a way of storing symlink metadata, the backup will skew even further from the source data, because we will have multiple copies of some files and directories, but not others (where a cycle was detected).
Is it sufficient to skip symlinks that point to the current directory or a parent directory? I think so, but need to think a bit more.
A circular symlink can also point to parent of the parent of the parent (...) of the current directory. As far as I know, we have to check all the way to / to be sure it's not circular.
B2 CLI's sync is effectively a simple backup application. While the cloud doesn't really enforce or provide guidance for storing symlink metadata, I think we could consider inventing something... But as of now, nobody reported a CLI issue about not being able to back up symlinks, so I'd skip that and just keep the CLI from spinning forever on circular symlinks.
Yes, good point -- I should have written "ancestor" where I wrote "parent".
I think it is relatively easy to check for direct ancestors to the sync source directory, or loops inside it.
Another interesting case is if there is a symlink from the sync dir (startdir below) to somewhere outside the tree that is not an ancestor, and in that outside tree, there is a symlink back to somewhere in the sync dir.
zackse@goyeto:~/backup-test# find -ls | cut -c16-
drwxr-xr-x 6 zackse staff 204 Oct 3 11:37 .
drwxr-xr-x 4 zackse staff 136 Oct 3 11:00 ./outsidedir
drwxr-xr-x 3 zackse staff 102 Oct 3 10:57 ./outsidedir/four
drwxr-xr-x 4 zackse staff 136 Oct 3 10:58 ./outsidedir/four/five
lrwxr-xr-x 1 zackse staff 21 Oct 3 10:58 ./outsidedir/four/five/one -> ../../../startdir/one
drwxr-xr-x 2 zackse staff 68 Oct 3 10:57 ./outsidedir/four/five/six
-rw-r--r-- 1 zackse staff 6 Oct 3 10:58 ./outsidedir/hello.txt
drwxr-xr-x 6 zackse staff 204 Oct 3 11:11 ./startdir
lrwxr-xr-x 1 zackse staff 1 Oct 3 11:11 ./startdir/badlink -> /
-rw-r--r-- 1 zackse staff 6 Oct 3 10:58 ./startdir/hello.txt
drwxr-xr-x 4 zackse staff 136 Oct 3 10:58 ./startdir/one
-rw-r--r-- 1 zackse staff 8 Oct 3 10:58 ./startdir/one/goodbye.txt
drwxr-xr-x 3 zackse staff 102 Oct 3 10:57 ./startdir/one/two
drwxr-xr-x 2 zackse staff 68 Oct 3 10:57 ./startdir/one/two/three
lrwxr-xr-x 1 zackse staff 13 Oct 3 11:00 ./startdir/outsidedir -> ../outsidedir
If we attempt to sync startdir, startdir/badlink should be skipped, since it points to an ancestor. The link startdir/outsidedir is OK, however, until we crawl its target and reach outsidedir/four/five/one, which points back into startdir.
We could just skip it, since it points back into the original source directory, but there isn't actually a cycle here. What do you think?
This situation is pretty nice, it should be a base for our unit tests.
What would a user of a backup utility which doesn't have symlink metadata backup/restore capability expect?
I think the symlinks are expected to be read and written as they were ordinary files (or hardlinks), with circular symlinks being an exception. Therefore ./startdir/outsidedir/four/five/one/goodbye.txt will get synced in pretty much the same way in which find would list it. Actually, I looked up man find and found some interesting notes about loops there, you might want to give it a read too.
@ppolewicz I am thinking about a reasonable approach here. It seems that the find command limits the number of symlink levels allowed while traversing the directory tree. I haven't yet researched in deeply but I think the default limit is 8 and it can be modified by user up to some max value. What do you think about this approach?
I don't think we need to limit the depth of symlink redirects, but only triggering the infinite loop detection mechanism if we have followed a symlink more than N times is a great idea!
Yes, it seems that a simple limit is an easy and efficient solution here. I am just worried that if we indeed only want to detect loops, then we will need to keep track of how many times each symlink was visited while traversing the tree. And the memory needed grows with the number of symlinks present in the directory we are syncing. So with many, many symlinks we may have issues here. Limiting the symlink levels regardless of whether they are the same or not, allows us to implement this with O(1) memory required. At least that is my current thinking...
We only need to check if we are in the loop when we are processing a symlink and then we just need to check if that path that the symlink points to has been seen already in the set of patch chunks that's on the path of the currently processed file. In other words, see if after jumping to the target of the link we won't find ourselves in a place we've already been at. The memory cost should be that of O(N) where N is the number of path chunks leading to the place where we are right now + an optimization might be to not try to run any detection unless we exceed some threshold (a number of link jumps or path length or something like that).
@ppolewicz I think this issue can be closed now.