lance icon indicating copy to clipboard operation
lance copied to clipboard

Datafusion fails to read from LanceDataset

Open dreverri opened this issue 1 year ago • 4 comments

I'm getting the following error when trying to read a LanceDB table with Datafusion:

[2024-12-20T19:10:11Z WARN  lance::dataset::write::insert] No existing dataset at /lance-dataset/data/sample-lancedb/my_table.lance, it will be created
Traceback (most recent call last):
  File "/lance-dataset/hello.py", line 23, in <module>
    main()
    ~~~~^^
  File "/lance-dataset/hello.py", line 19, in main
    df.show()
    ~~~~~~~^^
  File "/lance-dataset/.venv/lib/python3.13/site-packages/datafusion/dataframe.py", line 360, in show
    self.df.show(num)
    ~~~~~~~~~~~~^^^^^
Exception: External error: TypeError: LanceFragment.scanner() takes 1 positional argument but 2 positional arguments (and 3 keyword-only arguments) were given

I'm not sure if this is an issue with LanceDataset or Datafusion or if I am just doing something wrong.

Here is the code:

from datafusion import SessionContext
import lancedb


def main():
    uri = "data/sample-lancedb"
    db = lancedb.connect(uri)

    data = [
        {"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
        {"vector": [5.9, 26.5], "item": "bar", "price": 20.0},
    ]

    tbl = db.create_table("my_table", data=data, mode="overwrite")

    ctx = SessionContext()
    ctx.register_dataset("my_table", tbl.to_lance())
    df = ctx.table("my_table")
    df.show()


if __name__ == "__main__":
    main()

dreverri avatar Dec 20 '24 19:12 dreverri

Looks like you're using datafusion's pyarrow integration to read from a pyarrow dataset. Lance mimics a pyarrow dataset. This is how we are able to be queried from DuckDb. However, it seems that we don't mimic it faithfully enough :smile: and so Datafusion is getting confused.

I seem to recall digging into this a while back and Datafusion want to split up the dataset into fragments and query it that way and we didn't really flesh out the pyarrow fragment integration completely.

So there are two options we can take to fix this. First, we could fix up the python interface to more faithfully mimic pyarrow dataset but pyarrow dataset wasn't really intended to be a standard / protocol and there are a few limitations with this approach:

  • You won't get the proper parallelism on reads
  • Filters are not pushed down (or maybe they are but only a limited subset are supported)
  • Some python overhead (not sure if it is per-batch overhead or not but it might be and that could be significant for some queries)

A different approach (now that https://github.com/apache/datafusion-python/issues/823 has merged) would be to do something like this: https://github.com/delta-io/delta-rs/pull/3012/files

That would be limited to newer versions of datafusion python (43.1 and above) but would overcome the above drawbacks and be easier to maintain.

westonpace avatar Dec 20 '24 20:12 westonpace

(to be clear, both approaches will require changes to Lance)

westonpace avatar Dec 20 '24 20:12 westonpace

hi @westonpace , I would like to work on this , could you please provide some pointers on where it would be a good place to keep the FFI for lance? maybe rust/lance-datafusion/src/exec.rs ?

renato2099 avatar Feb 23 '25 18:02 renato2099

@renato2099 I think it will need to be in rust/lance. We introduce Dataset in rust/lance and you'll probably want to build the datafusion provider on top of that. The stuff in rust/lance-datafusion is more helper functions for us to run datafusion plans.

Also, it's probably worth noting that we have a TableProvider implementation in both lance and lancedb. They are very similar and should have the same capabilities. I think we will eventually want FFI table providers for both probably. Starting with lance is fine.

In lance the TableProvider is LanceTableProvider and it is located in rust/lance/src/datafusion/dataframe.rs. You could put the FFI code in there or you could put it in rust/lance/src/datafusion/ffi.rs. I don't have any strong preference.

This will be pretty cool to see come together 😄

westonpace avatar Feb 24 '25 14:02 westonpace

I think this can be closed now @westonpace :)

renato2099 avatar May 20 '25 11:05 renato2099