SNOW-643388: Are CTEs supported?
What is the current behavior?
Using the same dataframe in a self-join results in generated SQL that has multiple copies of the same sub-query.
What is the desired behavior?
It seems beneficial to have such dataframe operations use CTEs when generating the SQL.
How would this improve snowflake-snowpark-python?
More efficient query execution.
References, Other Background
I see that the dataframe api offers the cache_result() function. However, that makes use of temp tables, and requires the end-user to explicitly call cache_result() (rather than the library figuring out when to use CTEs).
Hi Ken, thanks for the feedback. Today extracting common subqueries into CTE is not supported. This is challenging to get right and requires some designs/thoughts, but it is definitely a nice optimization.
Thanks for the reply. Is a simpler first step to have an API that would allow the user to specify which .select() calls should be treated as CTEs?
Unfortunately I don't think we have that yet. It might be useful to look into cache_result if certain operations need to be executed multiple times.
CTE is supported now by automatically converting duplicate subqueries to CTEs. You can enable this feature by setting session.cte_optimization_enabled = True.