dvc icon indicating copy to clipboard operation
dvc copied to clipboard

pull: Pulling data produced with older DVC versions not possible

Open aschuh-hf opened this issue 2 years ago • 7 comments

Bug Report

Description

In our mono-repo we have some finished projects which used DVC 2 (and maybe even DVC 1 still). More recent projects use DVC 3 (dvc init --subdir). These projects may still require some outputs such as in particular trained model weights that were tracked with an older DVC version. However, the dvc.yaml pipeline files used features that have been dropped in DVC 3 (stage-level vars).

When trying to pull any file in the old DVC project folder, dvc pull some_file.dvc also tries to validate all other DVC pipeline files and fails with an error. This prevents us from checking out the data without making changes to historic DVC pipelines.

The same issue we have with dvc.lock files before the schema: '2.0' was introduced. These seem to be no longer supported by DVC 3.

required key not provided, in schema
'dvc.lock' validation failed: 13 errors.

extra keys not allowed, in create_index_table, line 2, column 3
    1 create_index_table:                                                                                                                                                                                                                                                                                                                                                                                              
    2   cmd: python -m scripts.create_index_table 

[...]

Reproduce

Expected

DVC as data versioning tool and tool to enable reproducibility should have strong backward compatibility for any generated data (and ideally also pipelines, though one could use an older DVC version to reproduce them). It should at the very least be possible to check out data produced with older DVC versions with the latest DVC version still.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 3.16.0 (conda)
---------------------------
Platform: Python 3.10.6 on Linux-3.10.0-1127.8.2.el7.x86_64-x86_64-with-glibc2.17
Subprojects:
        dvc_data = 2.15.4
        dvc_objects = 1.0.1
        dvc_render = 0.5.3
        dvc_task = 0.3.0
        scmrepo = 1.2.1
Supports:
        http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2023.6.0, boto3 = 1.26.76)
Config:
        Global: /home/aschuh/.config/dvc
        System: /etc/xdg/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: s3, s3
Workspace directory: xfs on /dev/sda1
Repo: dvc (subdir), git
Repo.site_cache_dir: /var/tmp/dvc/repo/f00b3983400f9f38ae255d78e2110269

Additional Information (if any):

aschuh-hf avatar Aug 25 '23 19:08 aschuh-hf

The DVC 3 migration guide states:

DVC 3.0 remains compatible with pre-existing data tracked by older DVC releases

aschuh-hf avatar Aug 25 '23 20:08 aschuh-hf

After manually editing dvc.lock files to add top-level keys

schema: '2.0'
stages:

the dvc pull target.dvc -v prints the following (note that the tracked file I want to pull has nothing to do with the pipeline / dvc.lock file in another directory).

$ dvc pull target.dvc -v
2023-08-25 20:27:43,600 DEBUG: v3.16.0 (conda), CPython 3.10.6 on Linux-3.10.0-1127.8.2.el7.x86_64-x86_64-with-glibc2.17
2023-08-25 20:27:43,600 DEBUG: command: /opt/conda/envs/venv/bin/dvc pull target.dvc -v
2023-08-25 20:27:46,321 DEBUG: Lockfile '../../../experiments/experiment/dvc.lock' needs to be updated.

The command seems stuck here and hasn't continued for several minutes.

^C2023-08-25 20:33:45,726 ERROR: interrupted by the user
Traceback (most recent call last):
  File "/opt/conda/envs/venv/lib/python3.10/site-packages/dvc/cli/__init__.py", line 209, in main
    ret = cmd.do_run()
  File "/opt/conda/envs/venv/lib/python3.10/site-packages/dvc/cli/command.py", line 26, in do_run
    return self.run()
  File "/opt/conda/envs/venv/lib/python3.10/site-packages/dvc/commands/data_sync.py", line 31, in run
    stats = self.repo.pull(
  File "/opt/conda/envs/venv/lib/python3.10/site-packages/dvc/repo/__init__.py", line 62, in wrapper
    return f(repo, *args, **kwargs)
  File "/opt/conda/envs/venv/lib/python3.10/site-packages/dvc/repo/pull.py", line 31, in pull
    processed_files_count = self.fetch(
  File "/opt/conda/envs/venv/lib/python3.10/site-packages/dvc/repo/__init__.py", line 62, in wrapper
    return f(repo, *args, **kwargs)
  File "/opt/conda/envs/venv/lib/python3.10/site-packages/dvc/repo/fetch.py", line 162, in fetch
    fetch_transferred, fetch_failed = ifetch(
  File "/opt/conda/envs/venv/lib/python3.10/site-packages/dvc_data/index/fetch.py", line 54, in fetch
    data.fs.exists(data.path)
  File "/opt/conda/envs/venv/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 355, in exists
    return self.fs.exists(path)
  File "/opt/conda/envs/venv/lib/python3.10/site-packages/fsspec/asyn.py", line 121, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/opt/conda/envs/venv/lib/python3.10/site-packages/fsspec/asyn.py", line 94, in sync
    if event.wait(1):
  File "/opt/conda/envs/venv/lib/python3.10/threading.py", line 607, in wait
    signaled = self._cond.wait(timeout)
  File "/opt/conda/envs/venv/lib/python3.10/threading.py", line 324, in wait
    gotit = waiter.acquire(True, timeout)
KeyboardInterrupt

EDIT: I then ran dvc status. This also took very long, but finished eventually.

aschuh-hf avatar Aug 25 '23 20:08 aschuh-hf

stage-level vars

@aschuh-hf This change was reverted in https://github.com/iterative/dvc/pull/9647, so it should not cause any failures.

DVC 3.0 remains compatible with pre-existing data tracked by older DVC releases

I think we should update this to say that it remains compatible with 2.x. @iterative/dvc Thoughts on this?

@aschuh-hf I agree with your overall point, and that is why we tried to remain compatible with at least 2.x despite a major release. I hope that's a reasonable compromise given our team size, and we aim to minimize these breaking changes.

dberenbaum avatar Aug 28 '23 14:08 dberenbaum

@dberenbaum I definitely see the need to every now and then break backwards compatibility. Given the maturity of the tool (major version 3) and user base, it's certainly best to minimize those as you already mention.

Apart from the aspect of needing an older major version to be able to actually reproduce pipelines by re-executing stages, I think there are two other aspects however which may require separate attention in regards to breaking changes.

One you also highlighted here, the impact it has on performing DVC operations which aggregate information across timepoints of the Git history (e.g., dvc metrics diff). This is a rather complex issue for the evolution of DVC. It's good you guys are aware and can thus weigh potential issues of breaking changes against their benefits.

The other issue is the reason I opened this GitHub issue. That even if a dvc.yaml schema is no longer supported by a more recent DVC version, I should be able to checkout the data versioned (and ideally also produced) with older versions of DVC. Thus, commands such as dvc pull target.dvc (and ideally also dvc pull target based on dvc.lock) should still work even if the schema has changed. This is because in this particular case, we are only using DVC as a data versioning tool (i.e., an alternative to Git LFS). Data added with dvc add using an older DVC version, I would still like to be able to pull and checkout with dvc pull.

In fact, the .dvc file we wanted to pull here was still valid for the latest DVC version. Yet, dvc pull target.dvc failed because it also tried to load and validate unrelated DVC pipeline files within the same DVC project. At the moment, I don't see why that would be necessary.

aschuh-hf avatar Aug 29 '23 22:08 aschuh-hf

Seemingly related issue pointed out by @skshetry: https://github.com/iterative/dvc/issues/8768#issuecomment-1375403471

aschuh-hf avatar Aug 29 '23 22:08 aschuh-hf

The other issue is the reason I opened this GitHub issue. That even if a dvc.yaml schema is no longer supported by a more recent DVC version, I should be able to checkout the data versioned (and ideally also produced) with older versions of DVC. Thus, commands such as dvc pull target.dvc (and ideally also dvc pull target based on dvc.lock) should still work even if the schema has changed. This is because in this particular case, we are only using DVC as a data versioning tool (i.e., an alternative to Git LFS). Data added with dvc add using an older DVC version, I would still like to be able to pull and checkout with dvc pull.

I think it's related to a number of issues like https://github.com/iterative/dvc/issues/6150 and https://github.com/iterative/dvc/issues/7585. Is your primary concern about backwards compatibility or about dvc being too aggressive in requiring that the entire project be in a "valid" state for any dvc operations to work?

dberenbaum avatar Oct 13 '23 16:10 dberenbaum

Is your primary concern about backwards compatibility or about dvc being too aggressive in requiring that the entire project be in a "valid" state for any dvc operations to work?

I opened the issue having the latter in mind. The data versioning functionality of DVC should always remain functional, and not break because some DVC Pipeline definitions require a specific version of the DVC tools. Especially in mono-repos where different projects may require different versions of the DVC tools, one should be able to just pull a data file versioned with DVC and stored in the linked cloud storage using any DVC version.

For the case here, this relates to data files added before with dvc add (.dvc files), but would be great if it also applies to DVC Pipeline outputs defined by dvc.lock.

aschuh-hf avatar Oct 16 '23 20:10 aschuh-hf

Closing as it's been a while we dropped support for 1.x lockfiles. It's unlikely we are going to add support for it back, but we will be careful while making incompatible changes, and we don't have any plans to do so for now.

skshetry avatar Mar 25 '24 10:03 skshetry