dvc.org guide: Data Management

UPDATE: https://github.com/iterative/dvc.org/issues/2856#issuecomment-1278366397

This is the plan for data management trail that focuses on:

Adding data to DVC projects & Versioning data in DVC projects
The cache (local)
& Shared cache (external)
Removing data from DVC projects
+ Creating remotes (link to config/remotes)
& Sync with remotes See also https://github.com/iterative/dvc.org/issues/2866
Accessing public datasets and data registries (get, import)
External data topics (See https://github.com/iterative/dvc.org/issues/563#issuecomment-1198687900)
- probably address https://github.com/iterative/dvc.org/issues/520

Adding data to DVC projects

Initialize a DVC repository and use dvc add to add files.
We'll assume MNIST data exist in a folder and will add it.

Versioning data in DVC projects

Overwrite Fashion-MNIST data on top of MNIST and update the dataset.
Go back and forth in Git history to get different datasets in the same folder.

Creating remotes

Add a Google Drive folder as a remote.
Make it default

Pushing to/pulling from remotes

Push the cache to the remote we created
Clone the repository to somewhere (e.g. ssh or local folder)
Pull the cache

Accessing public datasets and registries

Get the Fashion MNIST data from dataset-registry

Removing data from DVC projects

Remove certain folders from workspace
Delete the corresponding cache files

UPDATE: start with a reorg, see https://github.com/iterative/dvc.org/issues/2856#issuecomment-1110477162 below (may be enough).

Sep 27 '21 11:09 iesahin

👍

The only thing I can think of that you might want to add is import/update.

Sep 27 '21 19:09 dberenbaum

Actually, import/update might not be necessary since this is GS. In fact, I think you could probably stop the GS doc after "Pushing to/pulling from remotes."

Let me try to write something with the above plan and we can see what's extra. I think "public repository access" may be a good selling point. I believe these documents are like "brochures" and some kind of advertisement. I'm just trying to touch as many points without much technicalities. @dberenbaum

Oct 05 '21 15:10 iesahin

I think it's better to discuss the scope of the update here @shcheklein @dberenbaum @jorgeorpinel

I'll remove gc and remove from the document
Should I keep example-get-started intact? I'd use an empty repository and dvc add a file downloaded with wget. If I remove dvc import and code related tracking, I think this approach requires no code to be downloaded. People will just dvc init and start tracking their data.
The current titles include model keyword, should I keep it or not? Data and Model Management may be too broad but I think models are also data.

Oct 12 '21 11:10 iesahin

Should I keep example-get-started intact? I'd use an empty repository and dvc add a file downloaded with wget.

sounds good as a first step. I would focus specifically on restructuring the existing sections. A good first step can be as simple as moving all of them under the Data Management Trail and figuring out the redirects, names, intro part on the index page, etc

If I remove dvc import and code related tracking

could you clarify please?

model keyword, should I keep it or not?

Good question. Pros are that a bigger title could resonate with a bit more people. What are the downsides? I don't have a strong opinion on this. I would still try to keep in the text models as much as it makes sense to keep highlighting that models could be managed this way. There should be a feeling that it fits into GitOps for models (w/o data) scenario very well.

Oct 12 '21 21:10 shcheklein

I would focus specifically on restructuring the existing sections.

From my experience with the experiments doc, restructuring usually takes more time and effort than a rewrite. The flow of the current documents assumes the user will have a project. I don't want to assume this in the Data Management Trail, the user might be looking to track their music collection with Git. We don't need code, we don't need Python, we don't need anything except a bunch of files to tell Data Management features of DVC. (Even Git is optional I believe, I would like to have a "how to add your photos and send to your friend in a DVC repository" kind of project rather than "data science", etc.)

If I remove dvc import and code related tracking,

I don't think we need a project to download/clone from somewhere to tell the data management. The current docs begin by downloading code.zip. Data management should better start with data, and end with it.

I would still try to keep in the text models as much as it makes sense to keep highlighting that models could be managed this way.

I think we can have a separate "Model Management Trail" adding experiments and MLEM to the mix. Most of the features are here, we'll have links to sections in Data and Experiments Trails' sections. "Data and Model management" can also have a section about how models are actually like data files. However, from the pointer Jorge shared in the PR, I think it may be better to call this "Data Management" and write another "Model management" document specifically about creating models with experimentation and tracking them with DVC. We already have these features, it will just be another narrative.

Oct 13 '21 14:10 iesahin

To be more specific:

restructuring == move/rename existing sections. Don't change anything else. E.g. create a new section "Data Management Trail" that has inside it exactly same sections that we have now.

The flow of the current documents assumes the user will have a project. I don't want to assume this in the Data Management Trail

we can get to this later. I think experiments is a bigger priority still. Or even getting pipelines out of data.

the user might be looking to track their music collection with Git.

We can discuss this when we get there, but it's an open question. My personal take for now, we should stay close to ML examples in the regular data management trail. We can mention that non-ML specific scenarios could be covered.

Also, keep in mind - we can't cover everything in the Get started. We should prioritize the most critical scenario, the most critical path, the most common set of commands. This is get started (aka quick start), not UG.

The current docs begin by downloading code.zip. Data management should better start with data, and end with it.

Current docs begin with:

dvc init git commit dvc add data.xml

Very precise focus, everything else comes secondary.

Or maybe I misunderstood what documents do you mean?

Intention was to hide everything ... even code download happens in the expandable section in the third section or something.

I think we can have a separate "Model Management Trail" adding experiments and MLEM to the mix.

may be, I would come to that after experiments, after the basic data management restructuring and when we have MLEM a bit more complete.

Oct 13 '21 22:10 shcheklein

re title I'd just avoid the phrase "model management" since it has a specific meaning (not what this is about) but if you want to include "model" maybe use "data and model file management". I don't think we need to include "model" in the title but "model file(s)" can be included in the content.

Oct 14 '21 18:10 jorgeorpinel

since it has a specific meaning

I'm not sure we have it written somewhere? :) what kind of meaning do you have? what is so different between model management and data management?

(I can see for example that if we include model management we could keep some parts about metrics for example - which is also fine)

Oct 14 '21 19:10 shcheklein

Like, related to the ML model lifecycle? I mentioned this in the PR (https://www.dominodatalab.com/solutions/model-management/) and it wasn't contested so I assumed I was correct but you guys are the experts! If it doesn't have a special meaning then it doesn't matter. But if it does users and search engines could get confused.

Oct 15 '21 00:10 jorgeorpinel

I'm not an expert on naming things :) I put "model" to the title because the current docs have it, and we put "model" after a user requested it. I understand the meaning described in https://www.dominodatalab.com/solutions/model-management/ and how it differs from the way we use it, but in this new domain usually people use the same words to mean different things.

I have no strong opinion here, and honestly writing a specific "model management" document to the UG might be more appropriate. But until then, we can have "model" in the title and we can say that "models are files that can be tracked by DVC" in the text.

Oct 15 '21 17:10 iesahin

Looks like we already have a GS page for basic data management in https://dvc.org/doc/start/data-and-model-versioning. Can you recap what are the main differences with this proposal @iesahin ? Thanks

If nothing major, we should probably reuse whatever is still relevant here to plan for a new Data Management GUIDE instead. Rel. https://github.com/iterative/dvc.org/issues/144#issuecomment-844711839

Mar 30 '22 08:03 jorgeorpinel

Data Mgmt is simple enough that covering it in the Get Started and Command Reference has thus far been enough. But having a group of existing content under this "Category" could achieve some goals:

[x] Help reorg existing content (e.g. Large Dataset Optimization, Managing External Data, etc.)
[ ] Include Remote usage details? See https://github.com/iterative/dvc.org/issues/2866 Or should that be part of a Configuration guide? (rel. https://github.com/iterative/dvc.org/issues/340)

So just doing that reorg of existing content could be a good and quick first step, I think. Then we reconsider all the material proposed above. WDYT?

Apr 27 '22 02:04 jorgeorpinel

re Help reorg existing content

Based on the OP here's what we currently have for all the topics mentioned:

Adding data to DVC projects & Versioning data in DVC projects

Covered in https://dvc.org/doc/use-cases/versioning-data-and-models
and intro-ed in https://dvc.org/doc/start/data-management/data-versioning

The cache (local)

https://dvc.org/doc/user-guide/project-structure/internal-files and
https://dvc.org/doc/user-guide/data-management/large-dataset-optimization#configuring-dvc-cache-file-link-type + https://dvc.org/doc/command-reference/config#cache

& Shared cache (external)

https://dvc.org/doc/user-guide/how-to/share-a-dvc-cache + https://dvc.org/doc/command-reference/gc#cleaning-shared-cache-or-remote
https://dvc.org/doc/user-guide/data-management/managing-external-data#setting-up-an-external-cache

Removing data from DVC projects

https://dvc.org/doc/user-guide/how-to/stop-tracking-data + https://dvc.org/doc/command-reference/remove, https://dvc.org/doc/command-reference/gc

Creating remotes (link to config/remotes) & Sync with remotes

Intro-ed in https://dvc.org/doc/start/data-management/data-versioning#storing-and-sharing
More info. in refs under https://dvc.org/doc/command-reference/remote

Accessing public datasets and data registries (get, import)

Intro-ed in https://dvc.org/doc/start/data-management/data-and-model-access
Also covered in https://dvc.org/doc/use-cases/data-registry

External data topics

https://dvc.org/doc/user-guide/data-management/importing-external-data
https://dvc.org/doc/user-guide/data-management/managing-external-data + in refs e.g. https://dvc.org/doc/command-reference/add#example-transfer-to-remote-storage

Oct 13 '22 21:10 jorgeorpinel

Given that all that is already covered (albeit maybe disorganized) and not really the general goal of the UG (explanation-type docs), here's a new plan for the Data Management user guide:

[x] Motivation: Why do you need to manage data better (using DVC and it's Git-based approach) Maybe use a diagram (related to this one). https://github.com/iterative/dvc.org/pull/4042
[x] Form a perspective: Give clear overview of the pieces involved: workspace, Git, cache, remote storage, + basic and sync operations (add,remove,push,pull,etc) ? https://github.com/iterative/dvc.org/pull/4053
Secondary: Maybe concentrate info on some of these parts and features from existing docs (check refs, how-tos, etc.)
Idea: Emphasize the value by contrasting typical/ ad hoc methods vs DVC project structures (before/after)

Currently people may struggle to understand the difference; Note: don't overcomplicate this though (shouldn't look scary).
[x] Explain remote storage properly (scenarios, setup, sync, cleanup, etc.) Inc. https://github.com/iterative/dvc.org/issues/2866 https://github.com/iterative/dvc.org/pull/4058
[ ] Other applications i.e. shared cache, external data, etc. Inc. https://github.com/iterative/dvc.org/issues/520

Oct 14 '22 01:10 jorgeorpinel

Idea: Emphasize the value by contrasting typical/ ad hoc methods vs DVC project structures (before/after)

My only problem with this idea is that we should drive the value of the product and feature earlier than the user guide. This should be in use cases or in the Get Started if needed, even in README as well. OK to repeat in the UG as well, but as a quick recap. @shcheklein

Oct 14 '22 01:10 jorgeorpinel