reconcile_schemas: Parameter "schema" column names should be lowercased before column differential check

Open gavinpaes opened this issue 5 years ago • 0 comments

operators/s3_to_redshift_operator.py (Lines 189-191)

pg_query = \
            """
            SELECT column_name, udt_name
            FROM information_schema.columns
            WHERE table_schema = '{0}' AND table_name = '{1}';
            """.format(self.redshift_schema, self.table)
pg_schema = dict(pg_hook.get_records(pg_query))
incoming_keys = [column['name'] for column in schema]
diff = list(set(incoming_keys) - set(pg_schema.keys()))

In above snippet:

If "schema" column name contains any uppercase character, the column differential (diff) will erroneously be a non-empty set. This will in turn cause logic to attempt to insert a column that is already present in created table.

Example

Assume schema = {"name": "ColumnName", "type": _ }

pg_query will report column_name == "columnname" (automatically lowercased by redshift) but incoming keys will leverage column['name'] == "ColumnName" so:

In [1]: diff =  list(set(["ColumnName"]) - set(["columnname"]))
In [2]: diff
Out[2]: ['ColumnName']

This will cause subsequent logic to try to insert a new column called 'ColumnName' which will fail since 'columnname' already exists in created table.

May 22 '20 00:05 gavinpaes