Seed column & model column mismatch due to normalization

Open georgesittas opened this issue 1 year ago • 0 comments

To reproduce the bug:

Create new SQLMesh project using sqlmesh init postgres --template empty - Postgres is the default dialect because its normalization strategy is LOWERCASE, i.e. it treats quoted identifiers as case-sensitive, and lowercases unquoted ones.
Create the following seed model:

MODEL (
    name sqlmesh_example.demo_seed,
    kind SEED (
        path '../seeds/demo_seed.csv'
    ),
    columns(
        "camelCaseColumn" TEXT,
    ),
);

Create the following seed file under seeds/, named demo_seed.csv:

"camelCaseColumn"
foo

Run sqlmesh plan.

This currently produces the following error:

  File "sqlmesh/core/model/definition.py", line 1182, in render_seed
    df.loc[:, string_columns] = df[string_columns].mask(
  File "sqlmesh/.venv/lib/python3.10/site-packages/pandas/core/frame.py", line 3899, in __getitem__
    indexer = self.columns._get_indexer_strict(key, "columns")[1]
  File "sqlmesh/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6115, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
  File "sqlmesh/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6176, in _raise_if_missing
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['camelCaseColumn'], dtype='object')] are in the [columns]"

The reason this happens is because we normalize the seed file's column by lowercasing it, whereas the column property lists it with quotes, making it case-sensitive. We're essentially always treating CSV columns as case-insensitive, since there's no way to quote them and preserve those quotes in the dataframe returned by pd.read_csv.

Some ideas to fix this:

Always treat CSV columns as case-sensitive. This may be problematic because it's a constraint that propagates downstream, thus requiring quotes everywhere. One way a user could overcome this would be to make a new model that acts as a dependency inversion layer for that seed, e.g. SELECT "camelCaseColumn" AS camelCaseColumn FROM sqlmesh_example.demo_seed. Then, the column would not require quotes downstream from that "staging" model.
Continue normalizing the CSV columns, but treat the names enumerated in the columns model property as the source of truth. So for the example above, we'd rename the single dataframe column into camelCaseColumn to match the corresponding column in the seed's columns_to_types.

May 29 '24 19:05 georgesittas