Expanding nested star selects gives an error in Bigquery syntax
Before you file an issue
- Make sure you specify the "read" dialect eg.
parse_one(sql, read="spark"): bigquery dialect - Make sure you specify the "write" dialect eg.
ast.sql(dialect="duckdb"): Not applicable - Check if the issue still exists on main: Couldn't find anything related
Fully reproducible code snippet
sql_string = """
SELECT
foo.bar.*
FROM
foo
"""
import sqlglot
from sqlglot.optimizer.qualify_columns import qualify_columns
from sqlglot.schema import ensure_schema
ast = sqlglot.parse_one(sql_string, dialect="bigquery")
schema = ensure_schema(None, dialect="bigquery")
qualified = qualify_columns(
ast,
schema,
expand_alias_refs=True,
expand_stars=True,
infer_schema=None,
)
The qualify function throws an error:
sqlglot.errors.OptimizeError: Unknown table: bar
Official Documentation Bigquery docs
Also consider applicability to nested structs
SELECT
user.id,
struct(
user.id,
user.name,
user.email,
user.title,
user.harness_department_c,
user.role_level_c,
user.current_role_start_date_c,
user.geo_region_c,
user.geo_subregion_c,
user.sales_segment_c,
struct(manager.id, manager.name, manager.email) AS manager,
struct(se.id, se.name, se.email) AS se,
struct(tdr.id, tdr.name, tdr.email) AS tdr,
struct(tds.id, tds.name, tds.email) AS tds
)::STRUCT<id STRING, name STRING, email STRING, title STRING, harness_department_c STRING, role_level_c STRING, current_role_start_date_c DATE, geo_region_c STRING, geo_subregion_c STRING, sales_segment_c STRING, manager STRUCT<id STRING, name STRING, email STRING>, se STRUCT<id STRING, name STRING, email STRING>, tdr STRUCT<id STRING, name STRING, email STRING>, tds STRUCT<id STRING, name STRING, email STRING>> AS emp
FROM salesforce_v4.user
LEFT JOIN salesforce_v4.user AS manager
ON user.manager_id = manager.id
LEFT JOIN salesforce_v4.user AS se
ON user.se_c = se.id
LEFT JOIN salesforce_v4.user AS tdr
ON user.tdr_c = tdr.id
LEFT JOIN salesforce_v4.user AS tds
ON user.tds_c = tds.id
select a.emp.manager.* from this.table as a
Interesting, so it completely flattens structs. FYI @VaggelisD, an interesting edge case here is when two struct fields on different levels have the same names. BigQuery will give them names such as lvl1.foo, so if it's the only dialect that supports this struct/star explosion we could try to create matching aliases for the expanded projections.
There are some quirky edge cases for sure, and they're not all documented properly unfortunately (from a quick look). For the example you mentioned:
select
a, b, c.* from
(select
5 as a,
6 as b,
(select as struct
7 as a,
8 as b
) as c
)
gives an unnested table with fields
(a, b, a_1, b_1)
whereas selecting as a struct
select as struct
a, b, c.* from
(select
5 as a,
6 as b,
(select as struct
7 as a,
8 as b
) as c
)
leads to an unnested table with field names
(a, b, _field_3, _field_4)
ok I see, I'm not sure if it's worth preserving all of that then. Perhaps our usual _col_i naming scheme will suffice?
I agree, fully replicating the logic won't be necessary as with proper aliasing these edge cases will not occur
Yep 👍. FYI this is a WIP on our end, should have a fix soon.
Thanks for picking this up so quickly, I really appreciate it!
Currently trying to unnest a struct with * causes an warning in sqlmesh that prevents optimization due to missing schema for table I think. So for clarity, does this change in sqlglot also fix that sqlmesh behavior? If so, thats great!
Another consideration that causes a schema warning in sqlmesh disabling optimization:
A table is implicitly a struct when referenced directly in a projection.
select
users.*,
activity -- a table is implicitly a struct when referenced directly
from users
left join activity
using (user_id)
So if activity is:
CREATE TABLE activity (
id string,
user_id string,
event string,
ts timestamp
);
The referencing it directly, the output type is struct<id string, user_id string, event string, ts timestamp> as you'd guess. Not sure if this is related or not to the same codepath. Otherwise we can break this out.
cc @georgesittas
So I think the implicit unnesting is probably trickier, and my hunch is that Vaggelis' PR won't solve it. Though I'd double check to be sure.
Currently trying to unnest a struct with * causes an warning in sqlmesh that prevents optimization due to missing schema for table I think. So for clarity, does this change in sqlglot also fix that sqlmesh behavior? If so, thats great!
I think so! Can you share a small example to make sure I understood this correctly?
-- Given this is your upstream model called `user`
SELECT
user.id,
struct(
user.id,
user.name,
user.email,
user.title,
user.harness_department_c,
user.role_level_c,
user.current_role_start_date_c,
user.geo_region_c,
user.geo_subregion_c,
user.sales_segment_c,
struct(manager.id, manager.name, manager.email) AS manager,
struct(se.id, se.name, se.email) AS se,
struct(tdr.id, tdr.name, tdr.email) AS tdr,
struct(tds.id, tds.name, tds.email) AS tds
)::STRUCT<id STRING, name STRING, email STRING, title STRING, harness_department_c STRING, role_level_c STRING, current_role_start_date_c DATE, geo_region_c STRING, geo_subregion_c STRING, sales_segment_c STRING, manager STRUCT<id STRING, name STRING, email STRING>, se STRUCT<id STRING, name STRING, email STRING>, tdr STRUCT<id STRING, name STRING, email STRING>, tds STRUCT<id STRING, name STRING, email STRING>> AS emp
FROM salesforce_v4.user
LEFT JOIN salesforce_v4.user AS manager
ON user.manager_id = manager.id
LEFT JOIN salesforce_v4.user AS se
ON user.se_c = se.id
LEFT JOIN salesforce_v4.user AS tdr
ON user.tdr_c = tdr.id
LEFT JOIN salesforce_v4.user AS tds
ON user.tds_c = tds.id;
-- This will cause the optimizer to fail / warn, even if upstream schema was known fully
SELECT emp.* FROM user;
-- Ditto
SELECT emp.manager.* FROM user;
You can repro it with any struct * expansion in a downstream model
Yep, these should work after we fix this issue :)