dvc Exp run --dry: command fails with missing files from the dependencies

Description

I'm using dvc with hydra and I would like to check whether my experiment run will reuse caches of some stages outputs. The dvc exp run --dry command fails when some of the dependencies are missing. dvc status could do the job, but the problem with it is that it cannot emulate dvc exp run -n exp3 -S parameters.parameter_name=123. I can use hydra to compose params.yaml, but it requires some additional manual work. May be it's worth to allow dvc status to accept -S overrides for hydra and to compile params.yaml inside dvc status command?

Reproduce

git clone https://github.com/Danila89/dvc_empty.git && cd dvc_empty && git pull --all && git checkout dvc_status_issue && dvc exp run -n something --dry

Expected

Results of dry-run, indicating that I have missing dependencies for all the stages except stage1

Environment information

Output of dvc doctor:

(base) danila.savenkov@RS-UNIT-0099 dvc_empty % dvc doctor
DVC version: 3.30.3 (pip)
-------------------------
Platform: Python 3.10.9 on macOS-13.3.1-arm64-arm-64bit
Subprojects:
        dvc_data = 2.22.3
        dvc_objects = 1.3.0
        dvc_render = 0.5.3
        dvc_task = 0.3.0
        scmrepo = 1.4.1
Supports:
        http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2023.5.0, boto3 = 1.26.76)
Config:
        Global: /Users/danila.savenkov/Library/Application Support/dvc
        System: /Library/Application Support/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s3s1
Repo: dvc, git
Repo.site_cache_dir: /Library/Caches/dvc/repo/64bbbded2e55036b006c56ceaefa98e1

Dec 04 '23 14:12 Danila89

I'm using dvc with hydra and I would like to check whether my experiment run will reuse caches of some stages outputs. The dvc exp run --dry command fails when some of the dependencies are missing. dvc status could do the job, but the problem with it is that it cannot emulate dvc exp run -n exp3 -S parameters.parameter_name=123.

How does changing the parameters with -S affect whether the experiment will reuse cached data?

May be it's worth to allow dvc status to accept -S overrides for hydra and to compile params.yaml inside dvc status command?

I'm not sure we can fit it into dvc status, but maybe we can have a way to compile params.yaml in dvc exp run without actually running the experiment. Would that help?

Dec 06 '23 19:12 dberenbaum

How does changing the parameters with -S affect whether the experiment will reuse cached data?

As you can see here, stage1 depends on parameters.parameter_name. Passing parameters.parameter_name that is different from the cached run's parameters.parameter_name will invalidate the cache.

I'm not sure we can fit it into dvc status, but maybe we can have a way to compile params.yaml in dvc exp run without actually running the experiment. Would that help?

I believe just params.yaml compilation will not help. Dry run in this case fails both with and without params.yaml. In my opinion dry run should consider the outputs as changed when one of (cmd, deps, params) changed and as unchanged otherwise. And for sure it should not fail because of some missing outputs

Dec 06 '23 21:12 Danila89

Sorry folks, I haven't been following the whole discussion. So, to confirm, even before we discuss Hydra and params.yaml, there is another issue with the dvc exp run --dry in a simpler scenario. There is not way to tell it consider stages with missing dependencies as unchanged, is it correct?

Then my question is (probably you have discussed this already) - why is --dry --allow-missing not possible?

Or is it about making it a default behavior in --dry mode?

Dec 07 '23 04:12 shcheklein

Then my question is (probably you have discussed this already) - why is --dry --allow-missing not possible?

@shcheklein --dry --allow-missing works but not in the above example because there is no dvc.lock file in the repo. A non-dry run will also ignore --allow-missing here for the same reason.

@Danila89 This is why I mentioned in email that --allow-missing currently is not doing anything for you if you keep no version of dvc.lock. Similarly, it is expected (although probably not clear from the output) that dry run fails at the 2nd stage because it is missing 1.txt, which is a dependency of that stage. Without dvc.lock or actually running the stage, dry run does not know what version of 1.txt to expect.

Instead, I was suggesting a workaround where you could:

Compile params.yaml without running the experiment (doesn't exist but theoretically would be easy to add)
Run dvc status using that compiled params.yaml to get the info you want

Overall, AFAIU you want to use the run cache to check if any cached version of each stage exists and to restore that stage. We have discussed it before but have not done that with --dry-run or --allow-missing because it could be surprising behavior. It will end up creating or modifying entries in dvc.lock for matches that it finds, which might be unexpected for those flags. I'm happy to discuss it, but that's why it doesn't work like you expect today.

Dec 07 '23 17:12 dberenbaum

Instead, I was suggesting a workaround where you could:

Compile params.yaml without running the experiment (doesn't exist but theoretically would be easy to add)

Run dvc status using that compiled params.yaml to get the info you want

This will work in the current scenario, although it is not very convenient that I have to write a python script for p.1. But the next thing that will be pretty important for me is the ability to check whether the remote cache can be reused for dvc exp run --pull command. Currently dvc status is incapable of doing this, I'm wondering if dvc exp run --pull --dry is. Perhaps as a workaround I could pull all the remote cache locally (if it is possible with dvc) and after that use dvc status as you mentioned, but I would prefer a solution without massive cache pulling.

Without dvc.lock or actually running the stage, dry run does not know what version of 1.txt to expect.

Here is an example of dry run fail with both dvc.lock and params.yaml present: git clone https://github.com/Danila89/dvc_empty.git && cd dvc_empty && git pull --all && git checkout dvc_status_issue_1 && dvc exp run -n something --dry Is it still expected?

Dec 07 '23 19:12 Danila89

This will work in the current scenario, although it is not very convenient that I have to write a python script for p.1.

Sorry, I meant if we implemented a way to do this from DVC. However, I agree it's a bit of a hack, so not sure if this is the best solution, although maybe it's a useful debugging tool.

But the next thing that will be pretty important for me is the ability to check whether the remote cache can be reused for dvc exp run --pull command. Currently dvc status is incapable of doing this, I'm wondering if dvc exp run --pull --dry is.

Right, I don't think this will work as you would want today. What we need is better support for the run cache. Theoretically, it could be possible to:

pull only the run-cache, which is a lightweight record of the deps and outs of every run
for any dvc exp run -S command, try to get as far as possible using the run cache without actually running the stages
check if all data of that data exists on the remote

Maybe we could support something like --only-run-cache or --dry-run-cache that isn't quite the same as a dry run but only makes updates that are found in the run cache (so never executes any stages).

Here is an example of dry run fail with both dvc.lock and params.yaml present: git clone https://github.com/Danila89/dvc_empty.git && cd dvc_empty && git pull --all && git checkout dvc_status_issue_1 && dvc exp run -n something --dry Is it still expected?

Yes, try it with dvc exp run -n something --dry --allow-missing and it will succeed.

Dec 07 '23 19:12 dberenbaum

Yes, try it with dvc exp run -n something --dry --allow-missing and it will succeed.

Yep, it does not throw an error now, but the output is not very useful. For example if I run dvc exp run -n something --dry --allow-missing -S parameters.parameter_name=some_new_value I expect that all stages will be marked as changed because (dvc.yaml here):

For stage1 the params changed, which means that the output (1.txt) will also change
While 1.txt is changed, stage2 should also be changed, which causes 2.txt to be changed
While 2.txt is changed, which means that stage3 should also be changed

However the dry-run claims that stage1 changed, while others did not.

What we need is better support for the run cache. Theoretically, it could be possible to: (...)

In general, it sounds good, but there are some parts that I don't understand completely.

for any dvc exp run -S command, try to get as far as possible using the run cache without actually running the stages

It should be always possible to get until the end of the pipeline and mark all the stages as either changed or didn't change, shouldn't it?

only makes updates that are found in the run cache

Could you please elaborate more on the updates it will make? In the ideal scenario it should just output the info to the console without making any updates anywhere

Dec 07 '23 23:12 Danila89