guide: Data Management
UPDATE: https://github.com/iterative/dvc.org/issues/2856#issuecomment-1278366397
This is the plan for data management trail that focuses on:
- Adding data to DVC projects & Versioning data in DVC projects
- The cache (local)
- & Shared cache (external)
- Removing data from DVC projects
- + Creating remotes (link to config/remotes)
- & Sync with remotes See also https://github.com/iterative/dvc.org/issues/2866
- Accessing public datasets and data registries (get, import)
- External data topics (See https://github.com/iterative/dvc.org/issues/563#issuecomment-1198687900)
- probably address https://github.com/iterative/dvc.org/issues/520
Adding data to DVC projects
-
Initialize a DVC repository and use
dvc addto add files. -
We'll assume MNIST data exist in a folder and will add it.
Versioning data in DVC projects
- Overwrite Fashion-MNIST data on top of MNIST and update the dataset.
- Go back and forth in Git history to get different datasets in the same folder.
Creating remotes
-
Add a Google Drive folder as a remote.
-
Make it default
Pushing to/pulling from remotes
- Push the cache to the remote we created
- Clone the repository to somewhere (e.g. ssh or local folder)
- Pull the cache
Accessing public datasets and registries
- Get the Fashion MNIST data from dataset-registry
Removing data from DVC projects
- Remove certain folders from workspace
- Delete the corresponding cache files
UPDATE: start with a reorg, see https://github.com/iterative/dvc.org/issues/2856#issuecomment-1110477162 below (may be enough).
👍
The only thing I can think of that you might want to add is import/update.
Actually,
import/updatemight not be necessary since this is GS. In fact, I think you could probably stop the GS doc after "Pushing to/pulling from remotes."
Let me try to write something with the above plan and we can see what's extra. I think "public repository access" may be a good selling point. I believe these documents are like "brochures" and some kind of advertisement. I'm just trying to touch as many points without much technicalities. @dberenbaum
I think it's better to discuss the scope of the update here @shcheklein @dberenbaum @jorgeorpinel
- I'll remove
gcandremovefrom the document - Should I keep
example-get-startedintact? I'd use an empty repository anddvc adda file downloaded withwget. If I removedvc importand code related tracking, I think this approach requires no code to be downloaded. People will justdvc initand start tracking their data. - The current titles include
modelkeyword, should I keep it or not? Data and Model Management may be too broad but I think models are also data.
Should I keep example-get-started intact? I'd use an empty repository and dvc add a file downloaded with wget.
sounds good as a first step. I would focus specifically on restructuring the existing sections. A good first step can be as simple as moving all of them under the Data Management Trail and figuring out the redirects, names, intro part on the index page, etc
If I remove dvc import and code related tracking
could you clarify please?
model keyword, should I keep it or not?
Good question. Pros are that a bigger title could resonate with a bit more people. What are the downsides? I don't have a strong opinion on this. I would still try to keep in the text models as much as it makes sense to keep highlighting that models could be managed this way. There should be a feeling that it fits into GitOps for models (w/o data) scenario very well.
I would focus specifically on restructuring the existing sections.
From my experience with the experiments doc, restructuring usually takes more time and effort than a rewrite. The flow of the current documents assumes the user will have a project. I don't want to assume this in the Data Management Trail, the user might be looking to track their music collection with Git. We don't need code, we don't need Python, we don't need anything except a bunch of files to tell Data Management features of DVC. (Even Git is optional I believe, I would like to have a "how to add your photos and send to your friend in a DVC repository" kind of project rather than "data science", etc.)
If I remove dvc import and code related tracking,
I don't think we need a project to download/clone from somewhere to tell the data management. The current docs begin by downloading code.zip. Data management should better start with data, and end with it.
I would still try to keep in the text models as much as it makes sense to keep highlighting that models could be managed this way.
I think we can have a separate "Model Management Trail" adding experiments and MLEM to the mix. Most of the features are here, we'll have links to sections in Data and Experiments Trails' sections. "Data and Model management" can also have a section about how models are actually like data files. However, from the pointer Jorge shared in the PR, I think it may be better to call this "Data Management" and write another "Model management" document specifically about creating models with experimentation and tracking them with DVC. We already have these features, it will just be another narrative.
To be more specific:
restructuring == move/rename existing sections. Don't change anything else. E.g. create a new section "Data Management Trail" that has inside it exactly same sections that we have now.
The flow of the current documents assumes the user will have a project. I don't want to assume this in the Data Management Trail
we can get to this later. I think experiments is a bigger priority still. Or even getting pipelines out of data.
the user might be looking to track their music collection with Git.
We can discuss this when we get there, but it's an open question. My personal take for now, we should stay close to ML examples in the regular data management trail. We can mention that non-ML specific scenarios could be covered.
Also, keep in mind - we can't cover everything in the Get started. We should prioritize the most critical scenario, the most critical path, the most common set of commands. This is get started (aka quick start), not UG.
The current docs begin by downloading code.zip. Data management should better start with data, and end with it.
Current docs begin with:
dvc init
git commit
dvc add data.xml
Very precise focus, everything else comes secondary.
Or maybe I misunderstood what documents do you mean?
Intention was to hide everything ... even code download happens in the expandable section in the third section or something.
I think we can have a separate "Model Management Trail" adding experiments and MLEM to the mix.
may be, I would come to that after experiments, after the basic data management restructuring and when we have MLEM a bit more complete.
re title I'd just avoid the phrase "model management" since it has a specific meaning (not what this is about) but if you want to include "model" maybe use "data and model file management". I don't think we need to include "model" in the title but "model file(s)" can be included in the content.
since it has a specific meaning
I'm not sure we have it written somewhere? :) what kind of meaning do you have? what is so different between model management and data management?
(I can see for example that if we include model management we could keep some parts about metrics for example - which is also fine)
Like, related to the ML model lifecycle? I mentioned this in the PR (https://www.dominodatalab.com/solutions/model-management/) and it wasn't contested so I assumed I was correct but you guys are the experts! If it doesn't have a special meaning then it doesn't matter. But if it does users and search engines could get confused.
I'm not an expert on naming things :) I put "model" to the title because the current docs have it, and we put "model" after a user requested it. I understand the meaning described in https://www.dominodatalab.com/solutions/model-management/ and how it differs from the way we use it, but in this new domain usually people use the same words to mean different things.
I have no strong opinion here, and honestly writing a specific "model management" document to the UG might be more appropriate. But until then, we can have "model" in the title and we can say that "models are files that can be tracked by DVC" in the text.
Looks like we already have a GS page for basic data management in https://dvc.org/doc/start/data-and-model-versioning. Can you recap what are the main differences with this proposal @iesahin ? Thanks
If nothing major, we should probably reuse whatever is still relevant here to plan for a new Data Management GUIDE instead. Rel. https://github.com/iterative/dvc.org/issues/144#issuecomment-844711839
Data Mgmt is simple enough that covering it in the Get Started and Command Reference has thus far been enough. But having a group of existing content under this "Category" could achieve some goals:
-
[x] Help reorg existing content (e.g. Large Dataset Optimization, Managing External Data, etc.)
-
[ ] Include Remote usage details? See https://github.com/iterative/dvc.org/issues/2866 Or should that be part of a Configuration guide? (rel. https://github.com/iterative/dvc.org/issues/340)
So just doing that reorg of existing content could be a good and quick first step, I think. Then we reconsider all the material proposed above. WDYT?
re Help reorg existing content
Based on the OP here's what we currently have for all the topics mentioned:
Adding data to DVC projects & Versioning data in DVC projects
- Covered in https://dvc.org/doc/use-cases/versioning-data-and-models
- and intro-ed in https://dvc.org/doc/start/data-management/data-versioning
The cache (local)
- https://dvc.org/doc/user-guide/project-structure/internal-files and
- https://dvc.org/doc/user-guide/data-management/large-dataset-optimization#configuring-dvc-cache-file-link-type + https://dvc.org/doc/command-reference/config#cache
& Shared cache (external)
- https://dvc.org/doc/user-guide/how-to/share-a-dvc-cache + https://dvc.org/doc/command-reference/gc#cleaning-shared-cache-or-remote
- https://dvc.org/doc/user-guide/data-management/managing-external-data#setting-up-an-external-cache
Removing data from DVC projects
- https://dvc.org/doc/user-guide/how-to/stop-tracking-data + https://dvc.org/doc/command-reference/remove, https://dvc.org/doc/command-reference/gc
Creating remotes (link to config/remotes) & Sync with remotes
- Intro-ed in https://dvc.org/doc/start/data-management/data-versioning#storing-and-sharing
- More info. in refs under https://dvc.org/doc/command-reference/remote
Accessing public datasets and data registries (get, import)
- Intro-ed in https://dvc.org/doc/start/data-management/data-and-model-access
- Also covered in https://dvc.org/doc/use-cases/data-registry
External data topics
- https://dvc.org/doc/user-guide/data-management/importing-external-data
- https://dvc.org/doc/user-guide/data-management/managing-external-data + in refs e.g. https://dvc.org/doc/command-reference/add#example-transfer-to-remote-storage
Given that all that is already covered (albeit maybe disorganized) and not really the general goal of the UG (explanation-type docs), here's a new plan for the Data Management user guide:
-
[x] Motivation: Why do you need to manage data better (using DVC and it's Git-based approach) Maybe use a diagram (related to this one). https://github.com/iterative/dvc.org/pull/4042
-
[x] Form a perspective: Give clear overview of the pieces involved: workspace, Git, cache, remote storage, + basic and sync operations (
add,remove,push,pull,etc) ? https://github.com/iterative/dvc.org/pull/4053 -
Secondary: Maybe concentrate info on some of these parts and features from existing docs (check refs, how-tos, etc.)
-
Idea: Emphasize the value by contrasting typical/ ad hoc methods vs DVC project structures (before/after)
Currently people may struggle to understand the difference; Note: don't overcomplicate this though (shouldn't look scary).
-
[x] Explain remote storage properly (scenarios, setup, sync, cleanup, etc.) Inc. https://github.com/iterative/dvc.org/issues/2866 https://github.com/iterative/dvc.org/pull/4058
-
[ ] Other applications i.e. shared cache, external data, etc. Inc. https://github.com/iterative/dvc.org/issues/520
Idea: Emphasize the value by contrasting typical/ ad hoc methods vs DVC project structures (before/after)
My only problem with this idea is that we should drive the value of the product and feature earlier than the user guide. This should be in use cases or in the Get Started if needed, even in README as well. OK to repeat in the UG as well, but as a quick recap. @shcheklein