start: Pipelines Trail
This is the third step in GS restructuring as we discussed in #2496 (may be closed by addressing this one).
See https://github.com/iterative/dvc.org/issues/2496#issuecomment-847598646, https://github.com/iterative/dvc.org/issues/2496#issuecomment-857021652, https://github.com/iterative/dvc.org/issues/2496#issuecomment-857080772
It will introduce creating pipelines, adding stages and running them with dvc repro.
Create a pipeline
- Why do we use pipelines in DVC?
- What are dependencies
Add a stage
- Introduce
dvc stage add
Edit a stage
- Introduce editing
dvc.yaml - Mention
dvc stage add --force?
Run the pipeline
- Add another stage
- Introduce
dvc repro - Update an intermediate stage's dependency
- Rerun the pipeline
Visualize the pipeline
- List the stages
- Show the DAG
Removing Stages
- Introduce
dvc remove
@shcheklein @jorgeorpinel @dberenbaum
I think we could probably skip "Removing stages," especially if we introduce editing dvc.yaml.
Agreed with Dave. Overall - get started should not be a comprehensive overview. It should be a quick happy path that presents most important functionality and the value as fast as possible. Everything else comes secondary to that.
In this case it would be nice to start with dvc stage add, explain dvc.yaml, almost immediately (I would not even do subtitles for now) dvc repro or dvc exp run (exp run is probably even better). Then mention that pipelines could be advanced (templates), show pipeline.
That's pretty much it to be honest. Do we need two subsections for this - I don't know.
Ideally we would rely on one of the existing projects. Maybe the example-get-started one since it makes at least some sense to use pipelines there.
Ideally we would rely on one of the existing projects. Maybe the example-get-started one since it makes at least some sense to use pipelines there.
I can use example-get-started for this, but example-dvc-experiments also has a 2 stage pipeline, starting from extract (un-tar) and training with train.py. This one is simpler. example-get-started is a bit more complex.
also has a 2 stage pipeline, starting from extract (un-tar)
this is an ungly, unfortunate hack that we need to remove eventually :) it's very sad that we have it now in the project. It's not sustainable and not how DVC should be used.
The fact that we had to hack may be a bit ugly but telling the pipelines without resorting to Python or code seems like an alternative to me. The user may have a bit difficulty to bridge the gap between usual commands and an ML project, but the basic mechanism might be told in a simpler way.
Anyway, no strong opinions here, I'll proceed with example-get-started.
We can probably repurpose the relevant info here for https://github.com/iterative/dvc.org/issues/2883 instead (i.e. close this issue) and leave https://dvc.org/doc/start/data-pipelines as-is. Or are there still major issues with that page @iesahin ?
Guys do we still want a separate pipelines trail? Pipelining info is inside https://dvc.org/doc/start/data-management right now. I would personally like to see a separate one but I remember there were opinions agains that. I would put Experiments first, then Data Management, then Pipelines. WDYT @iesahin @dberenbaum @shcheklein ? Thanks