Add tips on structuring your pipeline
In particular,
- ensure you pull out repetitive bits into their own actions
- change expensive long-running loops so they can be run as individual actions with different parameters
The former saves you time by allowing you to use previously-computed outputs for dependent actions.
The latter gives you easy parallelisation for your loops
By extracting analytical choices out of an R or stata script and into the pipeline, we're meddling with users' code writing and organisation preferences. Not necessarily in a bad way, but something to be aware of.
For example for loops, it might be annoying if the parameters you want to iterate over are derived from the analysis itself.
In general, it would be useful to know more about how to "program" with yaml (or at least make it feel that way e.g. by passing vectors / lists outputted from a script into a single action). Evenutally this info will live in documentation but I need to understand more about it first! This is related to questions @angelwong121 has had previously.
Yes.
-
We should explain it so that they can make the choices. There are benefits to breaking out in loops but also drawbacks.
-
Eventually our YAML config format will grow the ability to repeat things with a range of parameters but not yet so we need an upgrade path
-
Generating YAML from code is one such strategy e.g. this script which only generates part of the YAML for copy-and-pasting
Cool. Of course by "meddling" I really meant "providing a larger menu of analytical options"
3 is a simple approach that's easy to implement and document
Just to note there's a ticket about this: https://github.com/opensafely/job-runner/issues/28