SNOW-654710: Drop cached temporary tables that are no longer being referenced
What is the current behavior?
cache_result does not clean up a temporary table when it is no longer being referenced. This results in unnecessary storage credits throughout the entirety of the session whenever a result is re-cached, which is common thing to do in an interactive session.
>>> dff = session.create_dataframe([1,2,3])
>>> dff.cache_result().explain()
---------DATAFRAME EXECUTION PLAN----------
Query List:
1.
SELECT * FROM (SNOWPARK_TEMP_TABLE_X3FCJ1U38A)
...
--------------------------------------------
>>> dff.cache_result().explain()
---------DATAFRAME EXECUTION PLAN----------
Query List:
1.
SELECT * FROM (SNOWPARK_TEMP_TABLE_Z9H68STVDH)
....
What is the desired behavior?
Ideal behavior would be a temporary table cleanup by defining a __del__ method for snowflake.snowpark.dataframe.DataFrame. I'll admit that I don't have a deep understanding of snowpark's innerworkings, but on the surface it seems like all cached results are essentialy a select * from (temp_table_name) which seems generalizable. In pseudo-ish code, it could be something like:
# Module: snowflake.snowpark.dataframe.DataFrame
class DataFrame:
...
def __del__(self):
if self.is_cached:
temp_table = # extract name from self.queries string or some other method
self._session.sql(f'drop table {temp_table').collect()
How would this improve snowflake-snowpark-python?
Avoid overcharging on storage credits :)
References, Other Background
Hey @mbkupfer, thanks for your feedback and we think it's a good idea! Since the implementation will be simple, are you willing to open a PR that adds this clean up mechanism? Thank you!
(Probably you don't need to parse queries to find the temp table name, you can create an internal property to store this temp table name when cache_result gets called.)
Sure, would be happy to! And good point on not having to parse anything out.
Hey @sfc-gh-jdu, I noticed a related out, but out of scope issue while working on this. Not sure if it's intentional or not, but the temporary table name that gets created is not fully qualified. So, if a user switches a session parameter (e.g schema or database) after a cache_result call the cached df will throw an object doesn't exist error.
>>> dff = session.create_dataframe([1,2,3]).cache_result()
>>> session.use_database('different_database')
>>> dff.collect()
SnowparkSQLException: (1304): 002003 (42S02): SQL compilation error:
Object 'SNOWPARK_TEMP_TABLE_I68AKU2J41' does not exist or not authorized.
@mbkupfer We're adding an API and context manager to manually drop the cached table. PR https://github.com/snowflakedb/snowpark-python/pull/549 The automatic cleanup will need to explore the underlying query plan data structures because even if a DataFrame/Table class is destructed, the query plan may be referenced by other query plans. We may work on it later.
@sfc-gh-yixie the context manager is a nice addition, but more importantly making the cache result into a Table added a simpler interface to drop tables. That said, I don't think this fixes the issue of using a cached result in an interactive session. It only would take care of the drop in a with block.
For instance,
dff = df.where('....').cache_result()
# and then later on a subsequent update
dff = df.where('....').cache_result()
Wouldn't this still produce two separate tables, unless we explicitly issue a dff.drop_table() call in between?
That's correct. You need to either use drop_table or the with statement for each dff.
This use case is a good candidate for automatic clean up. But we'll need to do it later.