dvc icon indicating copy to clipboard operation
dvc copied to clipboard

Pull imports from source

Open johnyaku opened this issue 3 years ago • 7 comments

Files that have been added via dvc import-url can be indvididually downloaded from their source location via dvc update <target>, but dvc pull looks for the files on the remote.

This request is for dvc pull to include an option to download these files from their source.

Whether this should be the default behaviour is open to discussion, but it would be helpful to at least have this option, let's called it --from-source for now.

Use case: In scientific research we often use previously published datasets as references (for comparison against new data, for example). These datasets are hosted on well-funded, stable file servers with stable URLs for each file. It would be helpful to be able to include such data in a DVC repo without having to include it in the DVC remote for that repo.

For less stable URLs (meaning URLs where the target data is subject to change) then I can see the value in including the data in the remote, as this will allow version control and fetching previous versions.

One way to deal with these conflicting use cases is to include an option for marking imported URLs as "fixed" (non-variable). This property could be recorded in the .dvc file so that DVC knows a) not to push the data to the remote and b) to pull the data from source as part of a dvc pull. Vanilla dvc import-url operations continue to function as currently.

johnyaku avatar Sep 09 '22 07:09 johnyaku

Related: https://github.com/iterative/dvc/issues/8172

dberenbaum avatar Sep 09 '22 16:09 dberenbaum

Perhaps the cleanest and simplest way to incorporate this kind of functionality is something like the following:

dvc update --all (download all files that have dvc import-urled from source into the workspace)

Then anyone wanting to replicate a dataset can grab all the necessary data with just

git clone <repo-url>
cd <repo>
dvc pull
dvc update --all

Alternatively, dvc pull could get an additional option as follows:

dvc pull --from-source (fall back to downloading from source iff data not available on remote, or if no remote set)

These alternatives are not exclusive.

But either way, suppose we have project A that uses dvc import-url to import data.csv from https://source.com. Then suppose that project B uses dvc import to import data.csv from project A into project B. Then dvc pull --from-source or dvc update --all (or both) should download data.csv from https://source.com into the workspace for project B.

In #8172 this is described as an edge case. But I suggest that this could be a fairly central use case in scientific publishing were invariant data is lodged in public repositories with stable URLs. Publishing raw data in these public repositories is often a condition of grant funding. So although we are likely to use DVC remotes while we initially compile and analyse our data, we will inevitably upload most if not all of the raw data to these public repositories. But if we continue to organise our datasets with DVC (just with data hosted in public repositories rather than DVC remotes) then future projects can build on already published datasets with a simple dvc import, regardless of where the data is actually hosted.

johnyaku avatar Sep 11 '22 23:09 johnyaku

Makes sense @johnyaku! For both import and import-url, an option to determine whether to push a copy of the data may be needed, like proposed in https://github.com/iterative/dvc/issues/4527.

Regardless, I see no reason DVC should not try to fallback to the original source if the data is not in the remote.

dberenbaum avatar Sep 12 '22 12:09 dberenbaum

Took a shot at update --all, here's a draft: https://github.com/iterative/dvc/pull/8288

dtrifiro avatar Sep 14 '22 13:09 dtrifiro

This is already supported using --recursive.

See https://github.com/iterative/dvc/issues/3511. Originally, it was requested to support --all, but we went with --recursive/-R. There is however a bug, it should skip non-import stages which should be fixed.

skshetry avatar Sep 15 '22 15:09 skshetry

update --recursive (or update --all) feels more like a workaround than an actual solution to this problem. There's a difference between pulling an import from source and updating it from source.

For fetch/pull I would expect DVC to verify that the source URL has not changed and then download it (like we do with import-url --no-download).

Update actually modifies the import to use the latest file from that source location, which is good enough for the case where you know the source is "stable", but does not solve this problem for the general purposes.

pmrowla avatar Sep 16 '22 00:09 pmrowla

Thanks @dtrifiro for your efforts and @skshetry for the workaround with update --recursive. This gets me over the immediate hurdle that I'm facing, but @pmrowla is correct that it does not solve the general case.

In particular, using the example projects A and B above, with data.csv originally import-urled into project A, at the moment running dvc import <url-for-project-A> data.csv in Project B fails because data.csv is not in the remote for Project A. The error message is similar to that when dvc pull fails in Project A, and so presumably the mechanism is similar.

Despite my earlier insistance that there really is invariant data with stable URLs, the semantics of import and pull have important differences, and despite being grateful for the workaround with update --recursive, I think it would be better to have the ability to pull --from-source (and also checkout --from-source). Perhaps this could become the default behaviour, but it would probably better to first provide the functionality and road test it in the real world first.

Perhaps the simplest way to achieve this functionality is with an additional --from-source option for pull/checkout/import. Alternatively (or additionally) perhaps there could be an extra field in .dvc files to mark files as originating from a source URL rather than a source DVC project (and associated remote). Such files could then be pulled from source without the need to specify the --from-source option.

Tangentially, it might similarly be helpful to mark certain data files as "unprotected". We use symlinks to an external cache but often need to be able to rewrite files after first running dvc unprotect <target>. Generally we know which files need to be unprotected, and do this prior to running the workflow. But it would be convenient to be able to mark these files as "unprotected" so that an unprotect operation automatically follows pull/checkout/update etc. This is tangential to the central feature request here, but mentioned in case there is an appetite for revising the fields in .dvc files.

johnyaku avatar Sep 20 '22 02:09 johnyaku