Exp run --dry: command fails with missing files from the dependencies
Description
I'm using dvc with hydra and I would like to check whether my experiment run will reuse caches of some stages outputs.
The dvc exp run --dry command fails when some of the dependencies are missing. dvc status could do the job, but the problem with it is that it cannot emulate dvc exp run -n exp3 -S parameters.parameter_name=123.
I can use hydra to compose params.yaml, but it requires some additional manual work.
May be it's worth to allow dvc status to accept -S overrides for hydra and to compile params.yaml inside dvc status command?
Reproduce
-
git clone https://github.com/Danila89/dvc_empty.git && cd dvc_empty && git pull --all && git checkout dvc_status_issue && dvc exp run -n something --dry
Expected
Results of dry-run, indicating that I have missing dependencies for all the stages except stage1
Environment information
Output of dvc doctor:
(base) danila.savenkov@RS-UNIT-0099 dvc_empty % dvc doctor
DVC version: 3.30.3 (pip)
-------------------------
Platform: Python 3.10.9 on macOS-13.3.1-arm64-arm-64bit
Subprojects:
dvc_data = 2.22.3
dvc_objects = 1.3.0
dvc_render = 0.5.3
dvc_task = 0.3.0
scmrepo = 1.4.1
Supports:
http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
s3 (s3fs = 2023.5.0, boto3 = 1.26.76)
Config:
Global: /Users/danila.savenkov/Library/Application Support/dvc
System: /Library/Application Support/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s3s1
Repo: dvc, git
Repo.site_cache_dir: /Library/Caches/dvc/repo/64bbbded2e55036b006c56ceaefa98e1
I'm using dvc with hydra and I would like to check whether my experiment run will reuse caches of some stages outputs. The
dvc exp run --drycommand fails when some of the dependencies are missing.dvc statuscould do the job, but the problem with it is that it cannot emulatedvc exp run -n exp3 -S parameters.parameter_name=123.
How does changing the parameters with -S affect whether the experiment will reuse cached data?
May be it's worth to allow
dvc statusto accept-Soverrides for hydra and to compileparams.yamlinsidedvc statuscommand?
I'm not sure we can fit it into dvc status, but maybe we can have a way to compile params.yaml in dvc exp run without actually running the experiment. Would that help?
How does changing the parameters with -S affect whether the experiment will reuse cached data?
As you can see here, stage1 depends on parameters.parameter_name. Passing parameters.parameter_name that is different from the cached run's parameters.parameter_name will invalidate the cache.
I'm not sure we can fit it into dvc status, but maybe we can have a way to compile params.yaml in dvc exp run without actually running the experiment. Would that help?
I believe just params.yaml compilation will not help. Dry run in this case fails both with and without params.yaml.
In my opinion dry run should consider the outputs as changed when one of (cmd, deps, params) changed and as unchanged otherwise. And for sure it should not fail because of some missing outputs
Sorry folks, I haven't been following the whole discussion. So, to confirm, even before we discuss Hydra and params.yaml, there is another issue with the dvc exp run --dry in a simpler scenario. There is not way to tell it consider stages with missing dependencies as unchanged, is it correct?
Then my question is (probably you have discussed this already) - why is --dry --allow-missing not possible?
Or is it about making it a default behavior in --dry mode?
Then my question is (probably you have discussed this already) - why is
--dry --allow-missingnot possible?
@shcheklein --dry --allow-missing works but not in the above example because there is no dvc.lock file in the repo. A non-dry run will also ignore --allow-missing here for the same reason.
@Danila89 This is why I mentioned in email that --allow-missing currently is not doing anything for you if you keep no version of dvc.lock. Similarly, it is expected (although probably not clear from the output) that dry run fails at the 2nd stage because it is missing 1.txt, which is a dependency of that stage. Without dvc.lock or actually running the stage, dry run does not know what version of 1.txt to expect.
Instead, I was suggesting a workaround where you could:
- Compile params.yaml without running the experiment (doesn't exist but theoretically would be easy to add)
- Run
dvc statususing that compiled params.yaml to get the info you want
Overall, AFAIU you want to use the run cache to check if any cached version of each stage exists and to restore that stage. We have discussed it before but have not done that with --dry-run or --allow-missing because it could be surprising behavior. It will end up creating or modifying entries in dvc.lock for matches that it finds, which might be unexpected for those flags. I'm happy to discuss it, but that's why it doesn't work like you expect today.
Instead, I was suggesting a workaround where you could:
- Compile params.yaml without running the experiment (doesn't exist but theoretically would be easy to add)
- Run dvc status using that compiled params.yaml to get the info you want
This will work in the current scenario, although it is not very convenient that I have to write a python script for p.1.
But the next thing that will be pretty important for me is the ability to check whether the remote cache can be reused for dvc exp run --pull command. Currently dvc status is incapable of doing this, I'm wondering if dvc exp run --pull --dry is. Perhaps as a workaround I could pull all the remote cache locally (if it is possible with dvc) and after that use dvc status as you mentioned, but I would prefer a solution without massive cache pulling.
Without dvc.lock or actually running the stage, dry run does not know what version of 1.txt to expect.
Here is an example of dry run fail with both dvc.lock and params.yaml present:
git clone https://github.com/Danila89/dvc_empty.git && cd dvc_empty && git pull --all && git checkout dvc_status_issue_1 && dvc exp run -n something --dry
Is it still expected?
This will work in the current scenario, although it is not very convenient that I have to write a python script for p.1.
Sorry, I meant if we implemented a way to do this from DVC. However, I agree it's a bit of a hack, so not sure if this is the best solution, although maybe it's a useful debugging tool.
But the next thing that will be pretty important for me is the ability to check whether the remote cache can be reused for
dvc exp run --pullcommand. Currentlydvc statusis incapable of doing this, I'm wondering ifdvc exp run --pull --dryis.
Right, I don't think this will work as you would want today. What we need is better support for the run cache. Theoretically, it could be possible to:
- pull only the run-cache, which is a lightweight record of the deps and outs of every run
- for any
dvc exp run -Scommand, try to get as far as possible using the run cache without actually running the stages - check if all data of that data exists on the remote
Maybe we could support something like --only-run-cache or --dry-run-cache that isn't quite the same as a dry run but only makes updates that are found in the run cache (so never executes any stages).
Here is an example of dry run fail with both
dvc.lockandparams.yamlpresent:git clone https://github.com/Danila89/dvc_empty.git && cd dvc_empty && git pull --all && git checkout dvc_status_issue_1 && dvc exp run -n something --dryIs it still expected?
Yes, try it with dvc exp run -n something --dry --allow-missing and it will succeed.
Yes, try it with
dvc exp run -n something --dry --allow-missingand it will succeed.
Yep, it does not throw an error now, but the output is not very useful. For example if I run dvc exp run -n something --dry --allow-missing -S parameters.parameter_name=some_new_value I expect that all stages will be marked as changed because (dvc.yaml here):
- For
stage1theparamschanged, which means that the output (1.txt) will also change - While
1.txtis changed,stage2should also be changed, which causes2.txtto be changed - While
2.txtis changed, which means thatstage3should also be changed
However the dry-run claims that stage1 changed, while others did not.
What we need is better support for the run cache. Theoretically, it could be possible to: (...)
In general, it sounds good, but there are some parts that I don't understand completely.
for any
dvc exp run -Scommand, try to get as far as possible using the run cache without actually running the stages
It should be always possible to get until the end of the pipeline and mark all the stages as either changed or didn't change, shouldn't it?
only makes updates that are found in the run cache
Could you please elaborate more on the updates it will make? In the ideal scenario it should just output the info to the console without making any updates anywhere