Yamale Feature Enhancement: Using data from the environment with validation

Imagine I have files in several folders, where each folder was a US state name (tennessee, virginia, massachusetts), and inside each of them contained a yaml file with the following contents:

state_name: "<state name>"
state_url: "https://<state name>.example.com"

I'd essentially like to be able to call yamale by setting an environment variable (STATE=tennessee yamale state.yaml) and have that value be the only thing allowed. Similarly, if I could do some form of string concatenation such that I could confirm that environment variable were provided as part of my value, that would be excellent. Any thoughts on this enhancement request? Possible example of the resulting schema.yaml below.

state_name: enum(env(STATE))
state_url: enum("https://" + env(STATE) + ".example.com")

Aug 04 '21 18:08 gms8994

I'm a little hesitant to add support for environment variables in the schema. We'd like to keep the schema as static as possible. Feel free to create your own validator, there's a small example here: https://github.com/23andMe/Yamale#custom-validators

And to support concatenation, you can override the __add__ method on the validator class. Let me know if you need any help, and thanks for your interest in Yamale.

Aug 06 '21 23:08 mildebrandt

Is there an easy way to add a custom validator that I can then use directly from the commandline? It seems like no, but if that's the case, then it feels like the cli is limited only to "base" scenarios - is that the case?

Aug 09 '21 19:08 gms8994

That's correct, there is no current way to load custom validators from the command line. We would be open to having that feature, and it has been brought up before: https://github.com/23andMe/Yamale/issues/57

If you would like to implement this feature, we do accept pull requests. Please submit a small outline of how the user would use this feature before developing to ensure we're on the same page.

Aug 09 '21 20:08 mildebrandt

Good to know. I'll definitely consider a PR. As for outline... My initial thought would be something like:

Using the literal style of YAML, allow "embedding" Python code directly in the schema file. This would give give the ability to keep the code and schema close together (though would prevent re-use of the code unless some kind of external include functionality were considered within the schema definition itself). You could then have something like:

custom_validator:
    type: validator
    code: |
        """
        Full contents of the CustomValidator(Validator) class goes here
        Where the class would be generated based on the name of the element (custom_validator) here
        And exec()'d to actually import the class
        Immediate checks could be done to determine if the eleemnt or the tag overloads any built-ins
        """
        
        tag = 'not_bool'

        def _is_valid(self, value):
            return not isinstance(value, bool)

I'm completely open to suggestions and feedback, and this is only a very quick idea, I haven't actually fleshed it out beyond verifying that it's possible to dynamically create a class within Python.

Note: custom_validator and CustomValidator here are just placeholders; the person creating the schema.yaml would be able to provide whatever value they wanted so as to not stomp on actual expected yaml keys

Aug 09 '21 20:08 gms8994

That's interesting. We recently patched a hole that allowed arbitrary code to be executed via schema loading and need to be careful around this area: https://github.com/23andMe/Yamale/pull/165

If we were to take this approach, I'd at least want a flag on the command line like --load-validators-from-schema to enable loading code from the schema file.

Another way would be to create a Python package that is installed via pip in the same environment as Yamale. The list of modules to load could be part of the schema or specified on the command line. Again in this case, I'd like a command line parameter to enable that feature so the user explicitly knows that arbitrary code can be loaded while processing the schema. The optional validator libraries could be published to PyPi and made required by your process to avoid loading from random files in some path on your machine.

Aug 09 '21 20:08 mildebrandt

I definitely like the idea of a CLI flag to enforce it. As for a pip package, I can see the benefit there for allowing "user contributed" validators that lots of other users would want to use. I still believe there's something to be gained by having specific validators (ie, the validator that I'm thinking about writing would almost certainly only be useful to me and my team).

Do you have any problems with me working on a PR to implement the above proposed solution?

Aug 09 '21 20:08 gms8994

Go for it! :)

Looking at the above, how will you determine what is loadable code and what isn't? I don't mind having a reserved key to determine this....something like yamale_validators or similar. The value of that would be an array of validators. We'd want to support adding more than one validator.

Some things to think about:

What happens when the user mistakenly adds two validators with the same tag? Is that an error or a warning?
What happens when the user attempts to override an existing validator?
Will this work in conjunction with loading a separate custom validator via code?
Keep in mind the other implementation in case we'd like to do that in the future as well. I'd hate to totally rewrite this to fit the other implementation.
Tests, tests, tests....someone else will be maintaining your code.

Please reach out with any questions!

Aug 09 '21 22:08 mildebrandt