Add support for testing schema changes
This proposes a new feature to test the schema in scan YAML files. Please give your input as comments if you see alternatives or other use cases that should be considered.
To test the schema during scans, the schema and column type configurations can be used in the scan YAML:
table_name: orders
schema: required
columns:
id:
type: int64
...
name:
type: character varying
...
location: ...
...
There are 4 options for the schema configuration property:
schema: required implies that each specified column has to be present. If one of the columns is not present
a single schema test will fail reporting all of the columns that are missing. It's allowed that other non-specified
columns are present in the schema.
schema: exact implies that each specified column has to be present and no other columns are allowed. If
there is any mismatch between the columns specified in the scan YAML and the observed columns, a single
test will fail listing all mismatches. A mismatch is either a specified column that is missing or a non specified
column that is present.
schema: previous_required implies that all columns present during the previous scan must be present. Note
that this test can only be executed starting from the second scan that is connected to a Soda cloud account.
schema: previous_exact implies that the schema must be exactly the same as the previous scan.
And there is the type configuration on the column:
If a type is specified on a column, it implies that the column must be present and that the observed type must
match the given type.
@tombaeyens does the .._exact semantic checks the order of the columns as well? We might need to protect the data against table alters. (Specially if this table is dumped to s3 and column order plays a critical role in reading the data from there)
Good point, @mmigdiso ! We could add an ordered keyword in the value as well and use a space-separated-syntax like this schema: [previous] (required|exact) [ordered]
wdyt?
Would these examples be clear on what they mean?
schema: required
schema: previous required
schema: exact
schema: exact ordered