funflow Discussion: External Config Design

As noted in the PR (#69) containing the initial version of external configuration support, there are some open design questions regarding the granularity at which a pipeline author should specify configurable task arguments. This issue is to record those thoughts and serve as a place for further discussion.

This breaks down into two related questions:

Should flow declarations specify the source for their configuration (e.g. FromEnv/FromFile) or should the user specify this at a higher level, prior to calling runFlow?
Should runFlow handle parsing config values from external sources (e.g. reading config files), or should this be done by the user prior to calling runFlow?

These questions could be addressed in a couple of different ways which would produce the following interfaces:

Case 1 (implemenation in tweag/funflow2#69)

flow = dockerFlow DockerTaskConfig {args=[FromEnv "FOO"]}
runFlow flow

Pros:

Simpler to invoke pipeline - runFlow automatically checks the environment for $FOO
Easier to reason about - config for a task can only come from one place

Cons:

If you build a library of specific tasks, users will have to pass in the config the way you specify
- For organizations, this may not be much of an issue - e.g. we specify password inputs as environment variables and all employees must adhere to this.
- For flows shared publicly this might be more annoying since people have different preferences. How often will flows themselves be shared outside of an organization?

Case 2.

Move config loading out one level

flow = dockerFlow DockerTaskConfig {args=[FromEnv "FOO"]}
config = readConfig $ getFlowConfigKeys flow
runFlow flow config

Pros:

Same as 1
Also separates out the logic for reading config from the environment and allows one to pass in specific values for testing
Already need a getFlowConfigKeys function to be able to automatically generate a CLI anyways

Cons:

Same as 1
Invoking the pipeline takes an extra step

Case 3.

Move specification of config sources out of flow declarations and have the user provide configs:

flow = dockerFlow DockerTaskConfig {args=[Configurable "FOO"]}
config = HashMap.fromList [("FOO", fromEnv "FOO")]
runFlow flow config

Pros:

Allows flows to be shared more flexibly since it abstracts the source of a config out of the task and leaves it to the user to handle getting the config.
Might be simpler to test since you could pass in values to runFlow without having it read from external sources

Cons:

Requires extra work on the part of the user - they have to be sure to pass in the required config when calling runFlow
If configurable args in a flow are a more general type, makes it more difficult to automatically construct a CLI unless we make it so that any configurable argument can be configured by the CLI.

Oct 28 '20 10:10 dorranh

Thanks a lot for this clarifying post :)

I don't really understand Case 2. Could you please elaborate on the line:

config = readConfig $ getFlowConfigKeys flow

One could also consider

Case 4.

The runFlow tries to load all configuration in order from all sources:

CLI
config file
environment

flow =  dockerFlow DockerTaskConfig {args=[Configurable "foo.bar"]}

runFlow flow input ()

and this foo.bar value can be passed:

from CLI myexecutable --foo-bar someValue
from file, e.g. YAML foo: { bar: someValue }
from environment FOO_BAR="someValue" myexecutable

My guess is that the config values will always come as string/text (no matter if it is from a file, from CLI or env).

Parsing to other type (Int, Float, other) could take place in specific interpreters where needed.

Oct 28 '20 13:10 GuillaumeDesforges

I don't really understand Case 2. Could you please elaborate on the line:

Ya, so one thing I realized is that to generate CLI flags using the configurables in a flow, we probably need to provide a function, getFlowConfigKeys or something similar, which can gather required config keys from the various tasks in a flow (e.g. prior to weaving it) so that we know which arguments the CLI will accept.

Since there might be additional top-level CLI arguments that we want to add besides those for the configurable values, it probably makes the most sense to do this outside of runFlow, that way the user could do something like the following (e.g. with optparse-applicative) :

import Options.Applicative

runFlowWithCLI -> Flow a b -> a -> IO b
runFlowWithCLI flow = do
   
   let  -- Create the main CLI options, maybe with other sub-commands, etc.
         topLevelCLI = (...) :: Parser
         -- Create the flow-specific configuration options
         flowCLIOpts = flowCLI flow

   -- Run the command line parser
   cliOpts <-  execParser (topLevelCLI <*> flowCLIOpts)

   -- Pass in the parsed options to runFlowWithConfig. We could add
   -- a field to `FlowConfig` for explicitly passing in config values.
   runFlowWithConfig flow $ configFromOpts cliOpts

   -- or, we can also handle reading env variables and the config file here prior
   -- to calling runFlowWithConfig.

where
  -- | Traverses a flow and builds a set of CLI options using the flow's 
  -- configurable fields. This could call the `getFlowConfigKeys` function I mentioned.
  flowCLI :: Flow -> Parser CLIOpts
  -- | Converts a parsed CLIOpts to a FlowConfig
  configFromOpts :: CLIOpts -> FlowConfig

Oct 29 '20 08:10 dorranh