sqlmesh
sqlmesh copied to clipboard
Seed column & model column mismatch due to normalization
To reproduce the bug:
- Create new SQLMesh project using
sqlmesh init postgres --template empty- Postgres is the default dialect because its normalization strategy isLOWERCASE, i.e. it treats quoted identifiers as case-sensitive, and lowercases unquoted ones. - Create the following seed model:
MODEL (
name sqlmesh_example.demo_seed,
kind SEED (
path '../seeds/demo_seed.csv'
),
columns(
"camelCaseColumn" TEXT,
),
);
- Create the following seed file under
seeds/, nameddemo_seed.csv:
"camelCaseColumn"
foo
- Run
sqlmesh plan.
This currently produces the following error:
File "sqlmesh/core/model/definition.py", line 1182, in render_seed
df.loc[:, string_columns] = df[string_columns].mask(
File "sqlmesh/.venv/lib/python3.10/site-packages/pandas/core/frame.py", line 3899, in __getitem__
indexer = self.columns._get_indexer_strict(key, "columns")[1]
File "sqlmesh/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6115, in _get_indexer_strict
self._raise_if_missing(keyarr, indexer, axis_name)
File "sqlmesh/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6176, in _raise_if_missing
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['camelCaseColumn'], dtype='object')] are in the [columns]"
The reason this happens is because we normalize the seed file's column by lowercasing it, whereas the column property lists it with quotes, making it case-sensitive. We're essentially always treating CSV columns as case-insensitive, since there's no way to quote them and preserve those quotes in the dataframe returned by pd.read_csv.
Some ideas to fix this:
- Always treat CSV columns as case-sensitive. This may be problematic because it's a constraint that propagates downstream, thus requiring quotes everywhere. One way a user could overcome this would be to make a new model that acts as a dependency inversion layer for that seed, e.g.
SELECT "camelCaseColumn" AS camelCaseColumn FROM sqlmesh_example.demo_seed. Then, the column would not require quotes downstream from that "staging" model. - Continue normalizing the CSV columns, but treat the names enumerated in the
columnsmodel property as the source of truth. So for the example above, we'd rename the single dataframe column intocamelCaseColumnto match the corresponding column in the seed'scolumns_to_types.