worktree add: `dvc add` removes valid metadata for files which are already tracked
dvc add always generates new stages, overwrites any existing stages and dumps a new .dvc file. For regular DVC workflows this is not a problem, but for cloud versioning/worktree remotes this causes a loss of metadata.
consider this workflow where I add one file and then push it:
$ dvc add foo
$ cat foo.dvc
outs:
- md5: d3b07384d113edec49eaa6238ad5ff00
size: 4
path: foo
$ dvc push
1 file pushed
$ cat foo.dvc
outs:
- md5: d3b07384d113edec49eaa6238ad5ff00
size: 4
path: foo
version_id: '2022-09-15T07:45:55.1712903Z'
etag: '0x8DA96EE52317788'
If I were to run dvc pull now, DVC would pull that specific version of the file (2022-09-15T07:45:55.1712903Z) from the remote.
If I now run dvc add without modifying foo, I get a .dvc file that does not contain any version/etag metadata, even though foo has not changed, and the metadata for the version I already pushed should still be valid:
$ dvc add foo
To track the changes with git, run:
git add foo.dvc
$ cat foo.dvc
outs:
- md5: d3b07384d113edec49eaa6238ad5ff00
size: 4
path: foo
Now if I run dvc pull, DVC will just pull the latest version of foo from the remote, which may not necessarily be the version that I initially pushed (since the .dvc file no longer contains a known version_id to pull)
Note that if I had modified foo, this metadata-clearing behavior is expected, since the newly modified version has not been pushed (and we do not know what the versionid/etag for that new version will be until we push).
The same problem applies to directories, where individual file entries that have not changed still lose metadata on dvc add.
For the worktree workflow, dvc add cannot just remove existing stages and re-write them. dvc add actually needs to check for outputs that are already tracked, and preserve the old metadata for file entries that have not changed.
For individual files, if the output hash (oid) has not changed, we need to preserve the existing metadata for the entire output.
For dirs, we need to do an actual tree merge, where we check the oids for each entry in the new tree. If we have meta for that oid in the pre-existing tree, it needs to be merged into the new one.
Now if I run
dvc pull, DVC will just pull the latest version offoofrom the remote, which may not necessarily be the version that I initially pushed (since the .dvc file no longer contains a known version_id to pull)
In this scenario where there is a cloud-versioned remote and no version id in the .dvc file, should dvc pull actually pull the latest version by default or should it fail? I would expect that pulling the latest version should be either a separate flag or command. Failing seems more consistent with the non-cloud-versioned workflow (dvc add followed by dvc pull will fail since the added data hasn't been pushed).
If version_aware is not enabled, I'm not sure how much sense this whole workflow makes, but I would expect dvc add && dvc pull to fail because the cloud data doesn't match the version info in the .dvc file. This is similar to doing dvc import-url --no-download followed by dvc pull. If the cloud object has changed, dvc pull will fail.
Maybe worth discussing with @dmpetrov .
I'm not sure what the behavior should be for fetch/pull with no etag or version ID, but that's a separate question from this issue. dvc add should not be modifying metadata for unchanged files, whether or not we allow fetching with no version ID
Sorry, I missed that this was without modifying foo.
Related to #8354 and #8409. DVC shouldn't be modifying version_id for unchanged files.