sqlglot Expanding nested star selects gives an error in Bigquery syntax

Before you file an issue

Make sure you specify the "read" dialect eg. parse_one(sql, read="spark"): bigquery dialect
Make sure you specify the "write" dialect eg. ast.sql(dialect="duckdb"): Not applicable
Check if the issue still exists on main: Couldn't find anything related

Fully reproducible code snippet

sql_string = """
SELECT
  foo.bar.*
FROM
  foo
"""

import sqlglot
from sqlglot.optimizer.qualify_columns import qualify_columns
from sqlglot.schema import ensure_schema

ast = sqlglot.parse_one(sql_string, dialect="bigquery")
schema = ensure_schema(None, dialect="bigquery")

qualified = qualify_columns(
  ast,
  schema,
  expand_alias_refs=True,
  expand_stars=True,
  infer_schema=None,
)

The qualify function throws an error: sqlglot.errors.OptimizeError: Unknown table: bar

Official Documentation Bigquery docs

May 15 '24 14:05 Bladieblah

Also consider applicability to nested structs

SELECT
    user.id,
    struct(
        user.id,
        user.name,
        user.email,
        user.title,
        user.harness_department_c,
        user.role_level_c,
        user.current_role_start_date_c,
        user.geo_region_c,
        user.geo_subregion_c,
        user.sales_segment_c,
        struct(manager.id, manager.name, manager.email) AS manager,
        struct(se.id, se.name, se.email) AS se,
        struct(tdr.id, tdr.name, tdr.email) AS tdr,
        struct(tds.id, tds.name, tds.email) AS tds
    )::STRUCT<id STRING, name STRING, email STRING, title STRING, harness_department_c STRING, role_level_c STRING, current_role_start_date_c DATE, geo_region_c STRING, geo_subregion_c STRING, sales_segment_c STRING, manager STRUCT<id STRING, name STRING, email STRING>, se STRUCT<id STRING, name STRING, email STRING>, tdr STRUCT<id STRING, name STRING, email STRING>, tds STRUCT<id STRING, name STRING, email STRING>> AS emp
FROM salesforce_v4.user
LEFT JOIN salesforce_v4.user AS manager
    ON user.manager_id = manager.id
LEFT JOIN salesforce_v4.user AS se
    ON user.se_c = se.id
LEFT JOIN salesforce_v4.user AS tdr
    ON user.tdr_c = tdr.id
LEFT JOIN salesforce_v4.user AS tds
    ON user.tds_c = tds.id

select a.emp.manager.* from this.table as a

May 17 '24 05:05 z3z1ma

Interesting, so it completely flattens structs. FYI @VaggelisD, an interesting edge case here is when two struct fields on different levels have the same names. BigQuery will give them names such as lvl1.foo, so if it's the only dialect that supports this struct/star explosion we could try to create matching aliases for the expanded projections.

May 17 '24 13:05 georgesittas

There are some quirky edge cases for sure, and they're not all documented properly unfortunately (from a quick look). For the example you mentioned:

select
  a, b, c.* from 
(select
  5 as a,
  6 as b,
  (select as struct 
    7 as a,
    8 as b
  ) as c
)

gives an unnested table with fields (a, b, a_1, b_1)

whereas selecting as a struct

select as struct 
  a, b, c.* from 
(select
  5 as a,
  6 as b,
  (select as struct 
    7 as a,
    8 as b
  ) as c
)

leads to an unnested table with field names (a, b, _field_3, _field_4)

May 17 '24 21:05 Bladieblah

ok I see, I'm not sure if it's worth preserving all of that then. Perhaps our usual _col_i naming scheme will suffice?

May 17 '24 23:05 georgesittas

I agree, fully replicating the logic won't be necessary as with proper aliasing these edge cases will not occur

May 22 '24 12:05 Bladieblah

Yep 👍. FYI this is a WIP on our end, should have a fix soon.

May 22 '24 12:05 georgesittas

Thanks for picking this up so quickly, I really appreciate it!

May 22 '24 16:05 Bladieblah

Currently trying to unnest a struct with * causes an warning in sqlmesh that prevents optimization due to missing schema for table I think. So for clarity, does this change in sqlglot also fix that sqlmesh behavior? If so, thats great!

Another consideration that causes a schema warning in sqlmesh disabling optimization:

A table is implicitly a struct when referenced directly in a projection.

select
  users.*,
  activity -- a table is implicitly a struct when referenced directly
from users
left join activity
  using (user_id)

So if activity is:

CREATE TABLE activity (
  id string,
  user_id string,
  event string,
  ts timestamp
);

The referencing it directly, the output type is struct<id string, user_id string, event string, ts timestamp> as you'd guess. Not sure if this is related or not to the same codepath. Otherwise we can break this out.

cc @georgesittas

May 22 '24 16:05 z3z1ma

So I think the implicit unnesting is probably trickier, and my hunch is that Vaggelis' PR won't solve it. Though I'd double check to be sure.

Currently trying to unnest a struct with * causes an warning in sqlmesh that prevents optimization due to missing schema for table I think. So for clarity, does this change in sqlglot also fix that sqlmesh behavior? If so, thats great!

I think so! Can you share a small example to make sure I understood this correctly?

May 22 '24 17:05 georgesittas

-- Given this is your upstream model called `user`
SELECT
    user.id,
    struct(
        user.id,
        user.name,
        user.email,
        user.title,
        user.harness_department_c,
        user.role_level_c,
        user.current_role_start_date_c,
        user.geo_region_c,
        user.geo_subregion_c,
        user.sales_segment_c,
        struct(manager.id, manager.name, manager.email) AS manager,
        struct(se.id, se.name, se.email) AS se,
        struct(tdr.id, tdr.name, tdr.email) AS tdr,
        struct(tds.id, tds.name, tds.email) AS tds
    )::STRUCT<id STRING, name STRING, email STRING, title STRING, harness_department_c STRING, role_level_c STRING, current_role_start_date_c DATE, geo_region_c STRING, geo_subregion_c STRING, sales_segment_c STRING, manager STRUCT<id STRING, name STRING, email STRING>, se STRUCT<id STRING, name STRING, email STRING>, tdr STRUCT<id STRING, name STRING, email STRING>, tds STRUCT<id STRING, name STRING, email STRING>> AS emp
FROM salesforce_v4.user
LEFT JOIN salesforce_v4.user AS manager
    ON user.manager_id = manager.id
LEFT JOIN salesforce_v4.user AS se
    ON user.se_c = se.id
LEFT JOIN salesforce_v4.user AS tdr
    ON user.tdr_c = tdr.id
LEFT JOIN salesforce_v4.user AS tds
    ON user.tds_c = tds.id;
    
    
-- This will cause the optimizer to fail / warn, even if upstream schema was known fully
SELECT emp.* FROM user;

-- Ditto
SELECT emp.manager.* FROM user;

You can repro it with any struct * expansion in a downstream model

May 23 '24 21:05 z3z1ma

Yep, these should work after we fix this issue :)

May 23 '24 21:05 georgesittas