stage add: StageExternalOutputsError on symlinked cache
Bug Report
Description
os.path.realpath resolves a symlink here: https://github.com/iterative/dvc/blob/82c5caee27d4b5591d4ab0b07fd1a73064ba8bff/dvc/output.py#L393 so if you want to add a stage where the output is already present but is cached it will think that it is external.
Reproduce
- dvc init
- dvc config cache.dir /some/external/dir
- dvc config cache.type symlink
- run and cache something
- dvc stage add /w --out something that has already been cached
Expected
Outputout.is_in_repo should not be False for a symlink to the cache in the repo
Environment information
$ dvc doctor
DVC version: 2.9.1 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.11.0-41-generic-x86_64-with-glibc2.29
Supports:
hdfs (fsspec = 2021.10.1, pyarrow = 4.0.0),
webhdfs (fsspec = 2021.10.1),
http (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.6),
https (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.6),
s3 (s3fs = 2021.10.1, boto3 = 1.17.106),
ssh (sshfs = 2021.8.1)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sda2
Caches: local
Remotes: local, local
Workspace directory: ext4 on /dev/sda2
Repo: dvc, git
@endremborza good catch! Does it break some use case? How did you stumble upon it?
I am working on a tool that manages pipelines that use dvc and it updates stages in dvc pipelines based on parts of the code. currently, this breaks with symlinked cache.
This is also reproducible in dvc add and it affects any symlink that points outside the repo directory (whether or not it's linked to the cache dir should be irrelevant in this scenario).
test-stage-add git:master ✓ ✩ py:dvc ❯ la
total 8.0K
drwxr-xr-x 6 pmrowla staff 192 Dec 15 11:01 .dvc
-rw-r--r-- 1 pmrowla staff 139 Dec 15 10:51 .dvcignore
drwxr-xr-x 11 pmrowla staff 352 Dec 15 11:09 .git
-rw-r--r-- 1 pmrowla staff 5 Dec 15 10:54 .gitignore
lrwxr-xr-x 1 pmrowla staff 6 Dec 15 11:02 foo -> ../foo
test-stage-add git:master ✓ ✩ py:dvc ❯ dvc add -v foo
2021-12-15 11:09:21,483 DEBUG: Adding '/Users/pmrowla/git/scratch/test-stage-add/.dvc/config.local' to gitignore file.
2021-12-15 11:09:21,490 DEBUG: Adding '/Users/pmrowla/git/scratch/test-stage-add/.dvc/tmp' to gitignore file.
2021-12-15 11:09:21,490 DEBUG: Adding '/Users/pmrowla/git/scratch/test-stage-add/.dvc/cache' to gitignore file.
2021-12-15 11:09:21,505 ERROR: Output(s) outside of DVC project: foo. See <https://dvc.org/doc/user-guide/managing-external-data> for more info.
------------------------------------------------------------
Traceback (most recent call last):
File "/Users/pmrowla/git/dvc/dvc/command/add.py", line 21, in run
self.repo.add(
File "/Users/pmrowla/git/dvc/dvc/utils/collections.py", line 163, in inner
result = func(*ba.args, **ba.kwargs)
File "/Users/pmrowla/git/dvc/dvc/repo/__init__.py", line 49, in wrapper
return f(repo, *args, **kwargs)
File "/Users/pmrowla/git/dvc/dvc/repo/scm_context.py", line 152, in run
return method(repo, *args, **kw)
File "/Users/pmrowla/git/dvc/dvc/repo/add.py", line 171, in add
stages = list(ui.progress(stages_it, desc=desc, unit="file"))
File "/Users/pmrowla/.virtualenvs/dvc/lib/python3.9/site-packages/tqdm/std.py", line 1180, in __iter__
for obj in iterable:
File "/Users/pmrowla/git/dvc/dvc/repo/add.py", line 259, in create_stages
stage = repo.stage.create(
File "/Users/pmrowla/git/dvc/dvc/repo/stage.py", line 181, in create
stage = create_stage(
File "/Users/pmrowla/git/dvc/dvc/stage/__init__.py", line 88, in create_stage
check_no_externals(stage)
File "/Users/pmrowla/git/dvc/dvc/stage/utils.py", line 136, in check_no_externals
raise StageExternalOutputsError(
dvc.stage.exceptions.StageExternalOutputsError: Output(s) outside of DVC project: foo. See <https://dvc.org/doc/user-guide/managing-external-data> for more info.
------------------------------------------------------------
This breaks the documented use case where you can add a symlinked file to make DVC copy that file (from the outside-of-repo symlinked location) into the cache.
This use case has been superseded by add --to-cache, but if we intended to drop/deprecate the "add symlinked file" behavior we need to document it properly (and account for the specific case where the symlink points to the DVC cache directory)
I think this issue probably needs more investigation - adding symlinks this way definitely worked properly in past versions of DVC, and we should figure out what/when actually broke this behavior. Depending on that, the fix in https://github.com/iterative/dvc/pull/7119 may or may not be sufficient?
Also, it's clear we should really have (add and stage add) test cases for this.
should also note that I did a quick check to see if this was related to the path_info -> string paths change, but it looks like this was already broken 4175f9ffa4948157436283bef11153a7233e90ca (before we started the drop path_info changes) @efiop
That's fair. Currently unfortunately I can't really dedicate too much time to this, but the fix in #7119 solved my case completely.
I started looking into it a little better as I made time for it.
There are 4 symlink related tests in func/test_add.py (which breaks a bunch of pylint rules btw and can not go through pre-commit) I added one in #7119 that basically covers the use case above. This breaks without the suggested change, and passes with it.
However, I'm not sure about the policy for handling this. Few possibilities I see:
- As suggested above, the whole thing can be dropped/deprecated. this requires a workaround for my original issue of a symlinked cache and a bit of documentation.
- A warning can be sent to the user explaining if copying of external symlinks is happening (which is the default behavior)
- Behavior can be dependent on cache settings: e.g. raise error on type: copy and ignore on type: symlink
I personally think that going the current way of the PR is fine, as those who deliberately use symlinks in a repo probably know what they are doing, but I don't have much insight on the user base. If a firm decision can be made quickly, I am open to implementing it, as I am quite eager for this to work in my cases without relying on my fork.
Looks like this was introduced in https://github.com/iterative/dvc/pull/5954.
Likely will be resolved as a part of https://github.com/iterative/dvc/issues/7384
#7581 closed #7384, but as noted in https://github.com/iterative/dvc/issues/7601#issuecomment-1103531163 😅 , this is still an issue. Any plan on how/when this is expected to be fixed?
Closing in favor of #7668
Not addressed by #7668. See https://github.com/iterative/dvc/issues/7668#issuecomment-1173062217.
I just needed to upgrade my dvc, but this is still broken, so I need to make the change shown in #7119 by hand every time I want to use a global symlink cache. can it really not be just added? as it seems to break nothing
This was fixed by https://github.com/iterative/dvc/pull/9626
thank you!
@endremborza Thank you for the feedback, let us know if you run into this problem again. And sorry for such a long turnaround time 🙁