Performance Improvement Preview: Duplicate Subquery Elimination

Open sfc-gh-jdu opened this issue 1 year ago • 0 comments

Hi all,

We have an exciting query performance optimization released with Snowpark Python v1.15.0, and we're looking for testing it with production workloads. This optimization will convert duplicate subqueries to CTEs automatically and reduce both query compilation and computation time. We would encourage you to try out this experimental feature if you

use the same Snowpark DataFrame multiple times to build another DataFrame in your workloads, e.g.

df = ...
df1 = df.filter(col("a") == 1)
df2 = df.with_column("c", lit(1))
df3 = df1.join(df2)

previously have manually used df.cache_result() to improve the performance by saving the intermediate result.

Feel free to try this optimization by setting

session.cte_optimization_enabled = True

at the beginning of code (then no cache_result() is needed), and watch the performance.

If you have any questions, feel free to reach out to me via [email protected] or comment under this issue directly. Any input is appreciated!

May 16 '24 22:05 sfc-gh-jdu