dvc icon indicating copy to clipboard operation
dvc copied to clipboard

stage add: StageExternalOutputsError on symlinked cache

Open endremborza opened this issue 4 years ago • 11 comments

Bug Report

Description

os.path.realpath resolves a symlink here: https://github.com/iterative/dvc/blob/82c5caee27d4b5591d4ab0b07fd1a73064ba8bff/dvc/output.py#L393 so if you want to add a stage where the output is already present but is cached it will think that it is external.

Reproduce

  1. dvc init
  2. dvc config cache.dir /some/external/dir
  3. dvc config cache.type symlink
  4. run and cache something
  5. dvc stage add /w --out something that has already been cached

Expected

Outputout.is_in_repo should not be False for a symlink to the cache in the repo

Environment information

$ dvc doctor
DVC version: 2.9.1 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.11.0-41-generic-x86_64-with-glibc2.29
Supports:
        hdfs (fsspec = 2021.10.1, pyarrow = 4.0.0),
        webhdfs (fsspec = 2021.10.1),
        http (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.6),
        s3 (s3fs = 2021.10.1, boto3 = 1.17.106),
        ssh (sshfs = 2021.8.1)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sda2
Caches: local
Remotes: local, local
Workspace directory: ext4 on /dev/sda2
Repo: dvc, git

endremborza avatar Dec 09 '21 17:12 endremborza

@endremborza good catch! Does it break some use case? How did you stumble upon it?

pared avatar Dec 13 '21 18:12 pared

I am working on a tool that manages pipelines that use dvc and it updates stages in dvc pipelines based on parts of the code. currently, this breaks with symlinked cache.

endremborza avatar Dec 14 '21 09:12 endremborza

This is also reproducible in dvc add and it affects any symlink that points outside the repo directory (whether or not it's linked to the cache dir should be irrelevant in this scenario).

test-stage-add git:master ✓ ✩  py:dvc ❯ la
total 8.0K
drwxr-xr-x  6 pmrowla staff 192 Dec 15 11:01 .dvc
-rw-r--r--  1 pmrowla staff 139 Dec 15 10:51 .dvcignore
drwxr-xr-x 11 pmrowla staff 352 Dec 15 11:09 .git
-rw-r--r--  1 pmrowla staff   5 Dec 15 10:54 .gitignore
lrwxr-xr-x  1 pmrowla staff   6 Dec 15 11:02 foo -> ../foo

test-stage-add git:master ✓ ✩  py:dvc ❯ dvc add -v foo
2021-12-15 11:09:21,483 DEBUG: Adding '/Users/pmrowla/git/scratch/test-stage-add/.dvc/config.local' to gitignore file.
2021-12-15 11:09:21,490 DEBUG: Adding '/Users/pmrowla/git/scratch/test-stage-add/.dvc/tmp' to gitignore file.
2021-12-15 11:09:21,490 DEBUG: Adding '/Users/pmrowla/git/scratch/test-stage-add/.dvc/cache' to gitignore file.
2021-12-15 11:09:21,505 ERROR: Output(s) outside of DVC project: foo. See <https://dvc.org/doc/user-guide/managing-external-data> for more info.
------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/pmrowla/git/dvc/dvc/command/add.py", line 21, in run
    self.repo.add(
  File "/Users/pmrowla/git/dvc/dvc/utils/collections.py", line 163, in inner
    result = func(*ba.args, **ba.kwargs)
  File "/Users/pmrowla/git/dvc/dvc/repo/__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "/Users/pmrowla/git/dvc/dvc/repo/scm_context.py", line 152, in run
    return method(repo, *args, **kw)
  File "/Users/pmrowla/git/dvc/dvc/repo/add.py", line 171, in add
    stages = list(ui.progress(stages_it, desc=desc, unit="file"))
  File "/Users/pmrowla/.virtualenvs/dvc/lib/python3.9/site-packages/tqdm/std.py", line 1180, in __iter__
    for obj in iterable:
  File "/Users/pmrowla/git/dvc/dvc/repo/add.py", line 259, in create_stages
    stage = repo.stage.create(
  File "/Users/pmrowla/git/dvc/dvc/repo/stage.py", line 181, in create
    stage = create_stage(
  File "/Users/pmrowla/git/dvc/dvc/stage/__init__.py", line 88, in create_stage
    check_no_externals(stage)
  File "/Users/pmrowla/git/dvc/dvc/stage/utils.py", line 136, in check_no_externals
    raise StageExternalOutputsError(
dvc.stage.exceptions.StageExternalOutputsError: Output(s) outside of DVC project: foo. See <https://dvc.org/doc/user-guide/managing-external-data> for more info.
------------------------------------------------------------

This breaks the documented use case where you can add a symlinked file to make DVC copy that file (from the outside-of-repo symlinked location) into the cache.

This use case has been superseded by add --to-cache, but if we intended to drop/deprecate the "add symlinked file" behavior we need to document it properly (and account for the specific case where the symlink points to the DVC cache directory)

pmrowla avatar Dec 15 '21 02:12 pmrowla

I think this issue probably needs more investigation - adding symlinks this way definitely worked properly in past versions of DVC, and we should figure out what/when actually broke this behavior. Depending on that, the fix in https://github.com/iterative/dvc/pull/7119 may or may not be sufficient?

Also, it's clear we should really have (add and stage add) test cases for this.


should also note that I did a quick check to see if this was related to the path_info -> string paths change, but it looks like this was already broken 4175f9ffa4948157436283bef11153a7233e90ca (before we started the drop path_info changes) @efiop

pmrowla avatar Dec 15 '21 02:12 pmrowla

That's fair. Currently unfortunately I can't really dedicate too much time to this, but the fix in #7119 solved my case completely.

endremborza avatar Dec 15 '21 14:12 endremborza

I started looking into it a little better as I made time for it.

There are 4 symlink related tests in func/test_add.py (which breaks a bunch of pylint rules btw and can not go through pre-commit) I added one in #7119 that basically covers the use case above. This breaks without the suggested change, and passes with it.

However, I'm not sure about the policy for handling this. Few possibilities I see:

  • As suggested above, the whole thing can be dropped/deprecated. this requires a workaround for my original issue of a symlinked cache and a bit of documentation.
  • A warning can be sent to the user explaining if copying of external symlinks is happening (which is the default behavior)
  • Behavior can be dependent on cache settings: e.g. raise error on type: copy and ignore on type: symlink

I personally think that going the current way of the PR is fine, as those who deliberately use symlinks in a repo probably know what they are doing, but I don't have much insight on the user base. If a firm decision can be made quickly, I am open to implementing it, as I am quite eager for this to work in my cases without relying on my fork.

endremborza avatar Jan 26 '22 23:01 endremborza

Looks like this was introduced in https://github.com/iterative/dvc/pull/5954.

dberenbaum avatar Jan 28 '22 21:01 dberenbaum

Likely will be resolved as a part of https://github.com/iterative/dvc/issues/7384

efiop avatar Feb 15 '22 13:02 efiop

#7581 closed #7384, but as noted in https://github.com/iterative/dvc/issues/7601#issuecomment-1103531163 😅 , this is still an issue. Any plan on how/when this is expected to be fixed?

dberenbaum avatar Apr 27 '22 15:04 dberenbaum

Closing in favor of #7668

efiop avatar May 03 '22 12:05 efiop

Not addressed by #7668. See https://github.com/iterative/dvc/issues/7668#issuecomment-1173062217.

dberenbaum avatar Jul 05 '22 14:07 dberenbaum

I just needed to upgrade my dvc, but this is still broken, so I need to make the change shown in #7119 by hand every time I want to use a global symlink cache. can it really not be just added? as it seems to break nothing

endremborza avatar Jan 18 '23 16:01 endremborza

This was fixed by https://github.com/iterative/dvc/pull/9626

efiop avatar Jun 27 '23 14:06 efiop

thank you!

endremborza avatar Jun 27 '23 15:06 endremborza

@endremborza Thank you for the feedback, let us know if you run into this problem again. And sorry for such a long turnaround time 🙁

efiop avatar Jun 27 '23 17:06 efiop