databricks-sql-python icon indicating copy to clipboard operation
databricks-sql-python copied to clipboard

Allow ingesting in-memory file-like objects

Open dhirschfeld opened this issue 1 year ago • 4 comments

Writing large amounts of data to disk, only for databricks-sql-connector to then read it back in from disk, is incredibly inefficient.

It would be much more efficient to be able to provide a file-like object to use instead of a filepath. In that way a user could write the data to an in-memory io.BytesIO object instead of writing the data to disk.

dhirschfeld avatar Sep 02 '24 08:09 dhirschfeld

i.e. allow passing through fh rather than creating it internally by opening a file from the filesystem: https://github.com/databricks/databricks-sql-python/blob/d31063ca918167412153a368c13a99055bf89c02/src/databricks/sql/client.py#L656-L668

dhirschfeld avatar Sep 02 '24 11:09 dhirschfeld

Hi @dhirschfeld! This indeed sounds like an intersting feature, thank you for sharing it! I have to talk with the rest of team first. Databricks SQL GET and PUT commands should have local file path specified, but I don't know if we ever considered using streams instead of real files. If we agree that there are no risks with this approach - we would have to implement it across all drivers eventually

kravets-levko avatar Sep 04 '24 17:09 kravets-levko

Some added context, @dhirschfeld's idea is exactly how the e2e tests for this feature behave (since we ran them in github actions where we don't have a real file system to write to). Should be a straightforward modification.

susodapop avatar Sep 21 '24 00:09 susodapop

This is an interesting ask. We have implemented very similar solution in other SQL driver, should be able to port that solution here.

gopalldb avatar Feb 27 '25 04:02 gopalldb

+1 to OP's ask. This would be a huge win.

jeffreyh1003 avatar Apr 25 '25 01:04 jeffreyh1003