dvc checkout --allow-missing / --skip-frozen / --exclude
dvc checkout is very convenient, especially when called as part of the post-checkout git hook. In fact, this is fairly crucial to keeping the git state aligned with the dvc state.
But consider the following scenario. Raw data is is imported from one repo (a registry) into another (a dataset). Both repos share a cache, so the import operation is almost instaneous, even tho the data is very big. Next, this raw data is processed in the dataset repo, and the stage that does the processing is frozen Later, the raw data that was previously imported is cleaned out of the cache to make space.
Now dvc checkout in the dataset repo will fail, because the some of the data is no longer available in the cache. We could always dvc pull it again, but it was cleaned out of the cache for a reason! We are done with it, downloading it from the remote would take forever, and the cache volume is full anyway.
It would be nice if dvc checkout made a "best effort" -- checking out all the data that it can find and throwing a warning about the data that it can't (and perhaps suggesting that we dvc pull the data if we really want it).
Alternatively, it would be helpful if dvc checkout (or even just dvc git-hook) had one or more options for dealing with this situation more gracefully. For example:
-
--allow-missing: turn on the "best effort" behaviour above -
--skip-frozen: don't even try to checkout deps of frozen stages -
--exclude: specifically exclude certain files or directories from the checkout
I think "best effort" as the default behaviour would save a lot of grief, particularly when called by a hook. Yes, there is a risk of the dvc and git state getting out of sync, but so long as the user is properly informed that is fine by me.
For the record, --allow-missing was already supported internally but I am not sure why it was not exposed to the CLI. Opened https://github.com/iterative/dvc/pull/9919
That's good news. Thanks @daavoo
Once this is incorporated it would be good to be able to include this in the post-checkout hook, either by default or by editing either the hook itself or perhaps a config option.
Great that --allow-missing is now exposed.
Is this the default behaviour? Or does the flag need to be specified?
How about the post-checkout hook? Do we need to modify it to allowing missing? What would we need to change?
Is this the default behaviour? Or does the flag need to be specified?
You need to specify it.
How about the
post-checkouthook? Do we need to modify it to allowing missing? What would we need to change?
Unfortunately, the current git hook just calls plain checkout with no flags:
https://github.com/iterative/dvc/blob/f98150086254ae271145b1d3307b75d32ae3cb20/dvc/commands/git_hook.py#L34-L55
Either you need to write your own or we would need to add support for passing flags to the hook.
A simple one that doesn't account for some things:
$ cat .git/hooks/post-checkout
#!/usr/bin/sh
exec dvc checkout --allow-missing
Thanks @daavoo!!
This issue is closed with the merge of #9919. Request the maintainers to kindly close the issue. cc: @daavoo