Discussion: External Config Design
As noted in the PR (#69) containing the initial version of external configuration support, there are some open design questions regarding the granularity at which a pipeline author should specify configurable task arguments. This issue is to record those thoughts and serve as a place for further discussion.
This breaks down into two related questions:
- Should flow declarations specify the source for their configuration (e.g.
FromEnv/FromFile) or should the user specify this at a higher level, prior to callingrunFlow? - Should
runFlowhandle parsing config values from external sources (e.g. reading config files), or should this be done by the user prior to callingrunFlow?
These questions could be addressed in a couple of different ways which would produce the following interfaces:
Case 1 (implemenation in tweag/funflow2#69)
flow = dockerFlow DockerTaskConfig {args=[FromEnv "FOO"]}
runFlow flow
Pros:
- Simpler to invoke pipeline - runFlow automatically checks the environment for $FOO
- Easier to reason about - config for a task can only come from one place
Cons:
- If you build a library of specific tasks, users will have to pass in the config the way you specify
- For organizations, this may not be much of an issue - e.g. we specify password inputs as environment variables and all employees must adhere to this.
- For flows shared publicly this might be more annoying since people have different preferences. How often will flows themselves be shared outside of an organization?
Case 2.
Move config loading out one level
flow = dockerFlow DockerTaskConfig {args=[FromEnv "FOO"]}
config = readConfig $ getFlowConfigKeys flow
runFlow flow config
Pros:
- Same as 1
- Also separates out the logic for reading config from the environment and allows one to pass in specific values for testing
- Already need a
getFlowConfigKeysfunction to be able to automatically generate a CLI anyways
Cons:
- Same as 1
- Invoking the pipeline takes an extra step
Case 3.
Move specification of config sources out of flow declarations and have the user provide configs:
flow = dockerFlow DockerTaskConfig {args=[Configurable "FOO"]}
config = HashMap.fromList [("FOO", fromEnv "FOO")]
runFlow flow config
Pros:
- Allows flows to be shared more flexibly since it abstracts the source of a config out of the task and leaves it to the user to handle getting the config.
- Might be simpler to test since you could pass in values to runFlow without having it read from external sources
Cons:
- Requires extra work on the part of the user - they have to be sure to pass in the required config when calling runFlow
- If configurable args in a flow are a more general type, makes it more difficult to automatically construct a CLI unless we make it so that any configurable argument can be configured by the CLI.
Thanks a lot for this clarifying post :)
I don't really understand Case 2. Could you please elaborate on the line:
config = readConfig $ getFlowConfigKeys flow
One could also consider
Case 4.
The runFlow tries to load all configuration in order from all sources:
- CLI
- config file
- environment
flow = dockerFlow DockerTaskConfig {args=[Configurable "foo.bar"]}
runFlow flow input ()
and this foo.bar value can be passed:
- from CLI
myexecutable --foo-bar someValue - from file, e.g. YAML
foo: { bar: someValue } - from environment
FOO_BAR="someValue" myexecutable
My guess is that the config values will always come as string/text (no matter if it is from a file, from CLI or env).
Parsing to other type (Int, Float, other) could take place in specific interpreters where needed.
I don't really understand Case 2. Could you please elaborate on the line:
Ya, so one thing I realized is that to generate CLI flags using the configurables in a flow, we probably need to provide a function, getFlowConfigKeys or something similar, which can gather required config keys from the various tasks in a flow (e.g. prior to weaving it) so that we know which arguments the CLI will accept.
Since there might be additional top-level CLI arguments that we want to add besides those for the configurable values, it probably makes the most sense to do this outside of runFlow, that way the user could do something like the following (e.g. with optparse-applicative) :
import Options.Applicative
runFlowWithCLI -> Flow a b -> a -> IO b
runFlowWithCLI flow = do
let -- Create the main CLI options, maybe with other sub-commands, etc.
topLevelCLI = (...) :: Parser
-- Create the flow-specific configuration options
flowCLIOpts = flowCLI flow
-- Run the command line parser
cliOpts <- execParser (topLevelCLI <*> flowCLIOpts)
-- Pass in the parsed options to runFlowWithConfig. We could add
-- a field to `FlowConfig` for explicitly passing in config values.
runFlowWithConfig flow $ configFromOpts cliOpts
-- or, we can also handle reading env variables and the config file here prior
-- to calling runFlowWithConfig.
where
-- | Traverses a flow and builds a set of CLI options using the flow's
-- configurable fields. This could call the `getFlowConfigKeys` function I mentioned.
flowCLI :: Flow -> Parser CLIOpts
-- | Converts a parsed CLIOpts to a FlowConfig
configFromOpts :: CLIOpts -> FlowConfig