Implement a cleanup step for cached DataFrames
Add an explicit SQL drop command that drops the underlying temporary table for cached DataFrames.
Please answer these questions before submitting your pull requests. Thanks!
-
What GitHub issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes #463
-
Fill out the following pre-review checklist:
- [x] I am adding a new automated test(s) to verify correctness of my new code
- [ ] I am adding new logging messages
- [ ] I am adding a new telemetry message
- [ ] I am adding new credentials
- [ ] I am adding a new dependency
-
Please describe how your code solves the related issue.
Add a
__del__finalizer method that drops the temporary table that gets created for cached results.
CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅
I have read the CLA Document and I hereby sign the CLA
recheck
Hi @mbkupfer thank you for putting out the PR. Agree that it's important to clean up cached tables. We'll discuss the solution in the next week. Will keep you updated. Our questions are:
- Will the clean-up cause any problems after a DataFrame is destructed. The underlying query plan may be referenced by other DataFrames.
- Connecting to a network resource from
__del__may impact the Python garbage collection. - Aside from auto-cleanup, do we give users an API to clean up whenever they want. Do you think this API will be helpful?
@sfc-gh-yixie those are some valid points. Gave it some thought and here are my replies:
Will the clean-up cause any problems after a DataFrame is destructed. The underlying query plan may be referenced by other DataFrames.
Yes, this is problematic indeed. IMHO I'd prefer to get an error as opposed to the program quietly using a previous table.
Moreover, this is actually more systemic to the overall package functionality and not just cached results. For instance, there would be no "safeguards" when end users don't use cache_result such as in the case below.
df = session.create_dataframe([1,2], schema=['col'])
df2 = df.where(F.col('col') > 1)
df = session.create_dataframe([3,4,5], schema=['col'])
df2.show()
Perhaps a way around this would be to insert additional clean up steps in the __del__ function that checks all the referents to the deleted dataframe, regardless of being cached or not, and ensure that those dataframes get deleted as well. It's a bit clunky, but it ensures that users avoid any pitfalls. What do you think?
Also want to note that this is out of scope for the temp table cleanup issue.
Connecting to a network resource from
__del__may impact the Python garbage collection.
True. I guess we can just call it a best attempt convenience feature with no guarantees. 😏
Aside from auto-cleanup, do we give users an API to clean up whenever they want. Do you think this API will be helpful?
I appreciate you seeking my opinion on this, and of course I defer the final decision to your team, but I don't really think this is necessary. I'd rather have the library use optimized intelligence to do any cleanup on its own as opposed to users blindly sending unnecessary clean up commands. Maybe if it comes with use-cases and best practices then sure, but just handing over the option may introduce confusion.