dvc.org icon indicating copy to clipboard operation
dvc.org copied to clipboard

start: Pipelines Trail

Open iesahin opened this issue 4 years ago • 7 comments

This is the third step in GS restructuring as we discussed in #2496 (may be closed by addressing this one).

See https://github.com/iterative/dvc.org/issues/2496#issuecomment-847598646, https://github.com/iterative/dvc.org/issues/2496#issuecomment-857021652, https://github.com/iterative/dvc.org/issues/2496#issuecomment-857080772

It will introduce creating pipelines, adding stages and running them with dvc repro.

Create a pipeline

  • Why do we use pipelines in DVC?
  • What are dependencies

Add a stage

  • Introduce dvc stage add

Edit a stage

  • Introduce editing dvc.yaml
  • Mention dvc stage add --force?

Run the pipeline

  • Add another stage
  • Introduce dvc repro
  • Update an intermediate stage's dependency
  • Rerun the pipeline

Visualize the pipeline

  • List the stages
  • Show the DAG

Removing Stages

  • Introduce dvc remove

@shcheklein @jorgeorpinel @dberenbaum

iesahin avatar Sep 27 '21 11:09 iesahin

I think we could probably skip "Removing stages," especially if we introduce editing dvc.yaml.

dberenbaum avatar Sep 27 '21 19:09 dberenbaum

Agreed with Dave. Overall - get started should not be a comprehensive overview. It should be a quick happy path that presents most important functionality and the value as fast as possible. Everything else comes secondary to that.

In this case it would be nice to start with dvc stage add, explain dvc.yaml, almost immediately (I would not even do subtitles for now) dvc repro or dvc exp run (exp run is probably even better). Then mention that pipelines could be advanced (templates), show pipeline.

That's pretty much it to be honest. Do we need two subsections for this - I don't know.

Ideally we would rely on one of the existing projects. Maybe the example-get-started one since it makes at least some sense to use pipelines there.

shcheklein avatar Oct 12 '21 21:10 shcheklein

Ideally we would rely on one of the existing projects. Maybe the example-get-started one since it makes at least some sense to use pipelines there.

I can use example-get-started for this, but example-dvc-experiments also has a 2 stage pipeline, starting from extract (un-tar) and training with train.py. This one is simpler. example-get-started is a bit more complex.

iesahin avatar Oct 13 '21 11:10 iesahin

also has a 2 stage pipeline, starting from extract (un-tar)

this is an ungly, unfortunate hack that we need to remove eventually :) it's very sad that we have it now in the project. It's not sustainable and not how DVC should be used.

shcheklein avatar Oct 14 '21 01:10 shcheklein

The fact that we had to hack may be a bit ugly but telling the pipelines without resorting to Python or code seems like an alternative to me. The user may have a bit difficulty to bridge the gap between usual commands and an ML project, but the basic mechanism might be told in a simpler way.

Anyway, no strong opinions here, I'll proceed with example-get-started.

iesahin avatar Oct 18 '21 05:10 iesahin

We can probably repurpose the relevant info here for https://github.com/iterative/dvc.org/issues/2883 instead (i.e. close this issue) and leave https://dvc.org/doc/start/data-pipelines as-is. Or are there still major issues with that page @iesahin ?

jorgeorpinel avatar Mar 30 '22 08:03 jorgeorpinel

Guys do we still want a separate pipelines trail? Pipelining info is inside https://dvc.org/doc/start/data-management right now. I would personally like to see a separate one but I remember there were opinions agains that. I would put Experiments first, then Data Management, then Pipelines. WDYT @iesahin @dberenbaum @shcheklein ? Thanks

jorgeorpinel avatar Jun 20 '22 23:06 jorgeorpinel