B2_Command_Line_Tool icon indicating copy to clipboard operation
B2_Command_Line_Tool copied to clipboard

Circular symlinks may cause sync to spin (almost) forever

Open ppolewicz opened this issue 7 years ago • 5 comments

Filed so that it is tracked separately from #390 / #512

ppolewicz avatar Sep 25 '18 00:09 ppolewicz

@zackse will you try to fix this one? It's a serious "help wanted" task, not a tiny "up-for-grabs" change.

ppolewicz avatar Sep 30 '18 23:09 ppolewicz

It all comes down to available cycles... Can't commit just yet, but I'm game for discussion. A few thoughts:

If we don't have a convention for storing symlink metadata in the cloud, does it make sense to ever transfer symlinks? Perhaps -- it just means there are multiple copies of files/directories stored and a restore is not an exact replica of the source data.

For directories, if we (naively) transfer only the first occurrence of an inode, but we don't have a way of storing symlink metadata, the backup will skew even further from the source data, because we will have multiple copies of some files and directories, but not others (where a cycle was detected).

Is it sufficient to skip symlinks that point to the current directory or a parent directory? I think so, but need to think a bit more.

zackse avatar Oct 02 '18 05:10 zackse

A circular symlink can also point to parent of the parent of the parent (...) of the current directory. As far as I know, we have to check all the way to / to be sure it's not circular.

B2 CLI's sync is effectively a simple backup application. While the cloud doesn't really enforce or provide guidance for storing symlink metadata, I think we could consider inventing something... But as of now, nobody reported a CLI issue about not being able to back up symlinks, so I'd skip that and just keep the CLI from spinning forever on circular symlinks.

ppolewicz avatar Oct 03 '18 01:10 ppolewicz

Yes, good point -- I should have written "ancestor" where I wrote "parent".

I think it is relatively easy to check for direct ancestors to the sync source directory, or loops inside it.

Another interesting case is if there is a symlink from the sync dir (startdir below) to somewhere outside the tree that is not an ancestor, and in that outside tree, there is a symlink back to somewhere in the sync dir.

zackse@goyeto:~/backup-test# find -ls | cut -c16-
drwxr-xr-x   6 zackse   staff         204 Oct  3 11:37 .
drwxr-xr-x   4 zackse   staff         136 Oct  3 11:00 ./outsidedir
drwxr-xr-x   3 zackse   staff         102 Oct  3 10:57 ./outsidedir/four
drwxr-xr-x   4 zackse   staff         136 Oct  3 10:58 ./outsidedir/four/five
lrwxr-xr-x   1 zackse   staff          21 Oct  3 10:58 ./outsidedir/four/five/one -> ../../../startdir/one
drwxr-xr-x   2 zackse   staff          68 Oct  3 10:57 ./outsidedir/four/five/six
-rw-r--r--   1 zackse   staff           6 Oct  3 10:58 ./outsidedir/hello.txt
drwxr-xr-x   6 zackse   staff         204 Oct  3 11:11 ./startdir
lrwxr-xr-x   1 zackse   staff           1 Oct  3 11:11 ./startdir/badlink -> /
-rw-r--r--   1 zackse   staff           6 Oct  3 10:58 ./startdir/hello.txt
drwxr-xr-x   4 zackse   staff         136 Oct  3 10:58 ./startdir/one
-rw-r--r--   1 zackse   staff           8 Oct  3 10:58 ./startdir/one/goodbye.txt
drwxr-xr-x   3 zackse   staff         102 Oct  3 10:57 ./startdir/one/two
drwxr-xr-x   2 zackse   staff          68 Oct  3 10:57 ./startdir/one/two/three
lrwxr-xr-x   1 zackse   staff          13 Oct  3 11:00 ./startdir/outsidedir -> ../outsidedir

If we attempt to sync startdir, startdir/badlink should be skipped, since it points to an ancestor. The link startdir/outsidedir is OK, however, until we crawl its target and reach outsidedir/four/five/one, which points back into startdir.

We could just skip it, since it points back into the original source directory, but there isn't actually a cycle here. What do you think?

zackse avatar Oct 03 '18 18:10 zackse

This situation is pretty nice, it should be a base for our unit tests.

What would a user of a backup utility which doesn't have symlink metadata backup/restore capability expect? I think the symlinks are expected to be read and written as they were ordinary files (or hardlinks), with circular symlinks being an exception. Therefore ./startdir/outsidedir/four/five/one/goodbye.txt will get synced in pretty much the same way in which find would list it. Actually, I looked up man find and found some interesting notes about loops there, you might want to give it a read too.

ppolewicz avatar Oct 03 '18 22:10 ppolewicz

@ppolewicz I am thinking about a reasonable approach here. It seems that the find command limits the number of symlink levels allowed while traversing the directory tree. I haven't yet researched in deeply but I think the default limit is 8 and it can be modified by user up to some max value. What do you think about this approach?

przadka avatar Apr 08 '23 07:04 przadka

I don't think we need to limit the depth of symlink redirects, but only triggering the infinite loop detection mechanism if we have followed a symlink more than N times is a great idea!

ppolewicz avatar Apr 08 '23 15:04 ppolewicz

Yes, it seems that a simple limit is an easy and efficient solution here. I am just worried that if we indeed only want to detect loops, then we will need to keep track of how many times each symlink was visited while traversing the tree. And the memory needed grows with the number of symlinks present in the directory we are syncing. So with many, many symlinks we may have issues here. Limiting the symlink levels regardless of whether they are the same or not, allows us to implement this with O(1) memory required. At least that is my current thinking...

przadka avatar Apr 08 '23 16:04 przadka

We only need to check if we are in the loop when we are processing a symlink and then we just need to check if that path that the symlink points to has been seen already in the set of patch chunks that's on the path of the currently processed file. In other words, see if after jumping to the target of the link we won't find ourselves in a place we've already been at. The memory cost should be that of O(N) where N is the number of path chunks leading to the place where we are right now + an optimization might be to not try to run any detection unless we exceed some threshold (a number of link jumps or path length or something like that).

ppolewicz avatar Apr 08 '23 21:04 ppolewicz

@ppolewicz I think this issue can be closed now.

przadka avatar May 25 '23 12:05 przadka