fix: presto table schema expansion when nested rows contain columns of `map` and/or `array` types
SUMMARY
When the PRESTO_EXPAND_DATA feature flag was set to True, we've noticed that some of our Presto tables were able to be previewed, while some other failed, with the following traceback
2024-01-30 14:08:00,598:ERROR:flask_appbuilder.api:list index out of range
Traceback (most recent call last):
File "//path/to/virtualenv/site-packages/flask_appbuilder/api/__init__.py", line 110, in wraps
return f(self, *args, **kwargs)
File "//path/to/virtualenv/site-packages/superset/views/base_api.py", line 127, in wraps
raise ex
File "//path/to/virtualenv/site-packages/superset/views/base_api.py", line 121, in wraps
duration, response = time_function(f, self, *args, **kwargs)
File "//path/to/virtualenv/site-packages/superset/utils/core.py", line 1454, in time_function
response = func(*args, **kwargs)
File "//path/to/virtualenv/site-packages/superset/utils/log.py", line 255, in wrapper
value = f(*args, **kwargs)
File "//path/to/virtualenv/site-packages/superset/databases/api.py", line 740, in table_metadata
table_info = get_table_metadata(database, table_name, schema_name)
File "//path/to/virtualenv/site-packages/superset/databases/utils.py", line 67, in get_table_metadata
columns = database.get_columns(table_name, schema_name)
File "//path/to/virtualenv/site-packages/superset/models/core.py", line 847, in get_columns
return self.db_engine_spec.get_columns(
File "//path/to/virtualenv/site-packages/superset/db_engine_specs/presto.py", line 1008, in get_columns
cls._parse_structural_column(column.Column, column.Type, result)
File "//path/to/virtualenv/site-packages/superset/db_engine_specs/presto.py", line 922, in _parse_structural_column
column_spec = cls.get_column_spec(field_info[1])
IndexError: list index out of range
The discriminating factor turned out to be the presence of map fields in nested rows, causing the parent_data_type argument passed to PrestoEngineSpec._parse_structural_column classmethod to be mis-parsed, as demonstrated:
(Pdb) w
...
-> spec._parse_structural_column("test_column", parent_data_type, accumulator)
> /Users/br/wmf/apache/superset/superset/db_engine_specs/presto.py(905)_parse_structural_column()
-> stack: list[tuple[str, str]] = []
(Pdb) parent_data_type
'row("protocol" varchar, "request_headers" map(varchar, varchar))'
(Pdb) data_types
['test_column row',
'"protocol" varchar,
"request_headers" map',
'varchar, varchar))']
The "request_headers" map(varchar, varchar) token was parsed into ['"request_headers" map', 'varchar, varchar))'], which later on breaks on ['varchar, varchar))'][1], as demonstrated in the previous traceback.
I've decided to fix this bug by removing the key/value types of the map itself from the parent_data_type string, as assuming they were parsed correctly, the types do not seem to be used by SQLAlchemy anyway:
(Pdb) cls.get_column_spec('map')
ColumnSpec(sqla_type=Map(), generic_type=<GenericDataType.STRING: 1>, is_dttm=False, python_date_format=None)
(Pdb) cls.get_column_spec('map(varchar, varchar)')
ColumnSpec(sqla_type=Map(), generic_type=<GenericDataType.STRING: 1>, is_dttm=False, python_date_format=None)
This way, the parent_data_type string 'row("protocol" varchar, "request_headers" map(varchar, varchar))' becomes 'row("protocol" varchar, "request_headers" map)', which gets successfully parsed as ['test_column row', '"protocol" varchar, "request_headers" map)']`.
Note: We also had the same issue with array types as well. A Presto array(string) column corresponds to a sqlalchemy.types.ARRAY type, so we can drop the (string) part of the column type.
BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
| Before | After |
|---|---|
TESTING INSTRUCTIONS
Preview the schema of a Presto table contained nested rows with the PRESTO_EXPAND_DATA feature flag enabled.
ADDITIONAL INFORMATION
- [ ] Has associated issue:
- [ ] Required feature flags:
- [ ] Changes UI
- [ ] Includes DB Migration (follow approval process in SIP-59)
- [ ] Migration is atomic, supports rollback & is backwards-compatible
- [ ] Confirm DB migration upgrade and downgrade tested
- [ ] Runtime estimates and downtime expectations provided
- [ ] Introduces new feature or API
- [ ] Removes existing feature or API
Thanks @brouberol for the PR. I've added a few reviewers who are likely familiar with this feature.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 65.08%. Comparing base (
76d897e) to head (ff43648). Report is 467 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #26892 +/- ##
==========================================
+ Coverage 60.48% 65.08% +4.59%
==========================================
Files 1931 526 -1405
Lines 76236 37907 -38329
Branches 8568 0 -8568
==========================================
- Hits 46114 24672 -21442
+ Misses 28017 13235 -14782
+ Partials 2105 0 -2105
| Flag | Coverage Δ | |
|---|---|---|
| hive | 49.09% <0.00%> (-0.07%) |
:arrow_down: |
| javascript | ? |
|
| presto | 53.66% <0.00%> (-0.15%) |
:arrow_down: |
| python | 65.08% <100.00%> (+1.59%) |
:arrow_up: |
| unit | 59.99% <100.00%> (+2.36%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Happy to review if/when this PR gets rebased