sqlglot icon indicating copy to clipboard operation
sqlglot copied to clipboard

fix(starrocks): exp.Unnest transpilation

Open hellozepp opened this issue 1 year ago • 4 comments

Fixes #3962

This PR adds support for the following:

  • Default initialize exp.UNNEST to include "unnest" as the default table alias if not specified
  • Added expression type of exp.Inline
  • use arrays_zip to merge multiple Lateral views
  • Added some ut to validate unnest transpile

hellozepp avatar Aug 24 '24 19:08 hellozepp

I tried to add the logic of setting default value when there is no table alias, PTAL @georgesittas @VaggelisD

I need to transpile the starrocks sql:

SELECT student, score, unnest FROM tests CROSS JOIN LATERAL UNNEST(scores)

Expected spark sql:

SELECT student, score, unnest FROM tests LATERAL VIEW EXPLODE(scores) unnest AS unnest

hellozepp avatar Aug 24 '24 20:08 hellozepp

I found that when unnest has multiple outputs, it will be transpiled into multiple Lateral view statements (actually incorrect), as shown in the following SQL of starrocks:

SELECT id, t.type, t.scores FROM example_table CROSS JOIN LATERAL unnest(split(type, ";"), scores) AS t(type,scores)

Transpiled into spark, the output is:

SELECT id, t.type, t.scores FROM example_table LATERAL VIEW EXPLODE(SPLIT(type, CONCAT('\\Q', ';'))) t AS type LATERAL VIEW EXPLODE(scores) t AS scores

Incorrect reason: The output of unnest is the combined value of multiple columns, and LATERAL VIEW will be expanded multiple times and cross joined. We can use arrays_zip to merge multiple Lateral views.

In fact, our users often use the syntax combination LATERAL VIEW INLINE(arrays_zip(col1, col2)).

table function inline doc as following:

databricks: https://docs.databricks.com/en/sql/language-manual/functions/inline.html spark: https://spark.apache.org/docs/latest/api/sql/#inline hive:https://stackoverflow.com/questions/25088488/use-inlinearraystruct-struct-in-hive

Expected output:

SELECT id, t.type, t.score FROM example_table LATERAL VIEW INLINE(ARRAYS_ZIP(SPLIT(type, CONCAT('\\Q', ';')), scores)) t AS type, score

But function arrays_zip may not be supported by some engines, such as hive. I added a parameter unnest_using_arrays_zip to avoid hive being affected.

I added a commit. Will be open to any suggestions =)

hellozepp avatar Aug 25 '24 10:08 hellozepp

@VaggelisD Added new commit, PTAL

hellozepp avatar Aug 26 '24 14:08 hellozepp

@VaggelisD Good evening, I made some change, please take a look at it when you have time

hellozepp avatar Aug 26 '24 16:08 hellozepp

@georgesittas @VaggelisD Added new commit, PTAL

hellozepp avatar Aug 27 '24 11:08 hellozepp

@georgesittas Agree. Thanks for the previous advice, it helped me a lot =)

hellozepp avatar Aug 29 '24 02:08 hellozepp

Sorry, I cleaned up my sqlglot repository because it was a mess. Wait a minute and I'll make another commit. There won't be any new code since our last commit.

hellozepp avatar Aug 29 '24 07:08 hellozepp