snowpark-python SNOW-654710: Drop cached temporary tables that are no longer being referenced

What is the current behavior?

cache_result does not clean up a temporary table when it is no longer being referenced. This results in unnecessary storage credits throughout the entirety of the session whenever a result is re-cached, which is common thing to do in an interactive session.

>>> dff = session.create_dataframe([1,2,3])

>>> dff.cache_result().explain()

---------DATAFRAME EXECUTION PLAN----------
Query List:
1.
SELECT  *  FROM (SNOWPARK_TEMP_TABLE_X3FCJ1U38A)
...

--------------------------------------------

>>> dff.cache_result().explain()

---------DATAFRAME EXECUTION PLAN----------
Query List:
1.
SELECT  *  FROM (SNOWPARK_TEMP_TABLE_Z9H68STVDH)
....

What is the desired behavior?

Ideal behavior would be a temporary table cleanup by defining a __del__ method for snowflake.snowpark.dataframe.DataFrame. I'll admit that I don't have a deep understanding of snowpark's innerworkings, but on the surface it seems like all cached results are essentialy a select * from (temp_table_name) which seems generalizable. In pseudo-ish code, it could be something like:

# Module: snowflake.snowpark.dataframe.DataFrame

class DataFrame:
...

    def __del__(self):
        if self.is_cached:
            temp_table = # extract name from self.queries string or some other method
            self._session.sql(f'drop table {temp_table').collect()

How would this improve `snowflake-snowpark-python`?

Avoid overcharging on storage credits :)

References, Other Background

Aug 31 '22 21:08 mbkupfer

Hey @mbkupfer, thanks for your feedback and we think it's a good idea! Since the implementation will be simple, are you willing to open a PR that adds this clean up mechanism? Thank you!

(Probably you don't need to parse queries to find the temp table name, you can create an internal property to store this temp table name when cache_result gets called.)

Sep 02 '22 20:09 sfc-gh-jdu

Sure, would be happy to! And good point on not having to parse anything out.

Sep 02 '22 20:09 mbkupfer

Hey @sfc-gh-jdu, I noticed a related out, but out of scope issue while working on this. Not sure if it's intentional or not, but the temporary table name that gets created is not fully qualified. So, if a user switches a session parameter (e.g schema or database) after a cache_result call the cached df will throw an object doesn't exist error.

>>> dff = session.create_dataframe([1,2,3]).cache_result()
>>> session.use_database('different_database')
>>> dff.collect()

SnowparkSQLException: (1304): 002003 (42S02): SQL compilation error:
Object 'SNOWPARK_TEMP_TABLE_I68AKU2J41' does not exist or not authorized.

Sep 05 '22 16:09 mbkupfer

@mbkupfer We're adding an API and context manager to manually drop the cached table. PR https://github.com/snowflakedb/snowpark-python/pull/549 The automatic cleanup will need to explore the underlying query plan data structures because even if a DataFrame/Table class is destructed, the query plan may be referenced by other query plans. We may work on it later.

Oct 06 '22 18:10 sfc-gh-yixie

@sfc-gh-yixie the context manager is a nice addition, but more importantly making the cache result into a Table added a simpler interface to drop tables. That said, I don't think this fixes the issue of using a cached result in an interactive session. It only would take care of the drop in a with block.

For instance,

dff = df.where('....').cache_result()

# and then later on a subsequent update

dff = df.where('....').cache_result()

Wouldn't this still produce two separate tables, unless we explicitly issue a dff.drop_table() call in between?

Oct 07 '22 16:10 mbkupfer

That's correct. You need to either use drop_table or the with statement for each dff. This use case is a good candidate for automatic clean up. But we'll need to do it later.

Oct 07 '22 16:10 sfc-gh-yixie

SNOW-654710: Drop cached temporary tables that are no longer being referenced

What is the current behavior?

What is the desired behavior?

How would this improve snowflake-snowpark-python?

References, Other Background

How would this improve `snowflake-snowpark-python`?