superset icon indicating copy to clipboard operation
superset copied to clipboard

fix: presto table schema expansion when nested rows contain columns of `map` and/or `array` types

Open brouberol opened this issue 2 years ago • 2 comments

SUMMARY

When the PRESTO_EXPAND_DATA feature flag was set to True, we've noticed that some of our Presto tables were able to be previewed, while some other failed, with the following traceback

2024-01-30 14:08:00,598:ERROR:flask_appbuilder.api:list index out of range
Traceback (most recent call last):
  File "//path/to/virtualenv/site-packages/flask_appbuilder/api/__init__.py", line 110, in wraps
    return f(self, *args, **kwargs)
  File "//path/to/virtualenv/site-packages/superset/views/base_api.py", line 127, in wraps
    raise ex
  File "//path/to/virtualenv/site-packages/superset/views/base_api.py", line 121, in wraps
    duration, response = time_function(f, self, *args, **kwargs)
  File "//path/to/virtualenv/site-packages/superset/utils/core.py", line 1454, in time_function
    response = func(*args, **kwargs)
  File "//path/to/virtualenv/site-packages/superset/utils/log.py", line 255, in wrapper
    value = f(*args, **kwargs)
  File "//path/to/virtualenv/site-packages/superset/databases/api.py", line 740, in table_metadata
    table_info = get_table_metadata(database, table_name, schema_name)
  File "//path/to/virtualenv/site-packages/superset/databases/utils.py", line 67, in get_table_metadata
    columns = database.get_columns(table_name, schema_name)
  File "//path/to/virtualenv/site-packages/superset/models/core.py", line 847, in get_columns
    return self.db_engine_spec.get_columns(
  File "//path/to/virtualenv/site-packages/superset/db_engine_specs/presto.py", line 1008, in get_columns
    cls._parse_structural_column(column.Column, column.Type, result)
  File "//path/to/virtualenv/site-packages/superset/db_engine_specs/presto.py", line 922, in _parse_structural_column
    column_spec = cls.get_column_spec(field_info[1])
IndexError: list index out of range

The discriminating factor turned out to be the presence of map fields in nested rows, causing the parent_data_type argument passed to PrestoEngineSpec._parse_structural_column classmethod to be mis-parsed, as demonstrated:

(Pdb) w
...
-> spec._parse_structural_column("test_column", parent_data_type, accumulator)
> /Users/br/wmf/apache/superset/superset/db_engine_specs/presto.py(905)_parse_structural_column()
-> stack: list[tuple[str, str]] = []
(Pdb) parent_data_type
'row("protocol" varchar, "request_headers" map(varchar, varchar))'
(Pdb) data_types
['test_column row',
 '"protocol" varchar,
 "request_headers" map',
 'varchar, varchar))']

The "request_headers" map(varchar, varchar) token was parsed into ['"request_headers" map', 'varchar, varchar))'], which later on breaks on ['varchar, varchar))'][1], as demonstrated in the previous traceback.

I've decided to fix this bug by removing the key/value types of the map itself from the parent_data_type string, as assuming they were parsed correctly, the types do not seem to be used by SQLAlchemy anyway:

(Pdb) cls.get_column_spec('map')
ColumnSpec(sqla_type=Map(), generic_type=<GenericDataType.STRING: 1>, is_dttm=False, python_date_format=None)
(Pdb) cls.get_column_spec('map(varchar, varchar)')
ColumnSpec(sqla_type=Map(), generic_type=<GenericDataType.STRING: 1>, is_dttm=False, python_date_format=None)

This way, the parent_data_type string 'row("protocol" varchar, "request_headers" map(varchar, varchar))' becomes 'row("protocol" varchar, "request_headers" map)', which gets successfully parsed as ['test_column row', '"protocol" varchar, "request_headers" map)']`.

Note: We also had the same issue with array types as well. A Presto array(string) column corresponds to a sqlalchemy.types.ARRAY type, so we can drop the (string) part of the column type.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Before After
image Screenshot 2024-01-30 at 17 16 50

TESTING INSTRUCTIONS

Preview the schema of a Presto table contained nested rows with the PRESTO_EXPAND_DATA feature flag enabled.

ADDITIONAL INFORMATION

  • [ ] Has associated issue:
  • [ ] Required feature flags:
  • [ ] Changes UI
  • [ ] Includes DB Migration (follow approval process in SIP-59)
    • [ ] Migration is atomic, supports rollback & is backwards-compatible
    • [ ] Confirm DB migration upgrade and downgrade tested
    • [ ] Runtime estimates and downtime expectations provided
  • [ ] Introduces new feature or API
  • [ ] Removes existing feature or API

brouberol avatar Jan 30 '24 16:01 brouberol

Thanks @brouberol for the PR. I've added a few reviewers who are likely familiar with this feature.

john-bodley avatar Jan 30 '24 18:01 john-bodley

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 65.08%. Comparing base (76d897e) to head (ff43648). Report is 467 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #26892      +/-   ##
==========================================
+ Coverage   60.48%   65.08%   +4.59%     
==========================================
  Files        1931      526    -1405     
  Lines       76236    37907   -38329     
  Branches     8568        0    -8568     
==========================================
- Hits        46114    24672   -21442     
+ Misses      28017    13235   -14782     
+ Partials     2105        0    -2105     
Flag Coverage Δ
hive 49.09% <0.00%> (-0.07%) :arrow_down:
javascript ?
presto 53.66% <0.00%> (-0.15%) :arrow_down:
python 65.08% <100.00%> (+1.59%) :arrow_up:
unit 59.99% <100.00%> (+2.36%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Jan 30 '24 18:01 codecov[bot]

Happy to review if/when this PR gets rebased

mistercrunch avatar Jul 22 '24 17:07 mistercrunch