lance icon indicating copy to clipboard operation
lance copied to clipboard

Filtering on dates when using `to_table`

Open JSpenced opened this issue 2 years ago • 1 comments

I was trying to test filtering and the primary thing I would want to filter for one example dataset is the date. I tried to get the date working and looking through a bunch of pyarrow code, but I couldn't figure it out. Just checking if anyone knows if it is supported and if there is an example somewhere or how to get dates working correctly.

dfl = dataset.to_table(
    columns=["inference_date"], 
    limit=0, 
    # also doesn't work
    # filter=ds.field("inference_date") > pd.to_datetime("2020-01-05')
    filter=ds.field("inference_date") > pa.scalar(datetime.datetime.strptime('2020-01-05', "%Y-%m-%d"))
).to_pandas()

Error is ValueError: LanceError(I/O): sql parser error: Expected ), found: 00.

/home/ubuntu/ml/SETUP_test_lance_datasets.ipynb Cell 12' in <cell line: 1>()
----> [1](vscode-notebook-cell://ssh-remote%2Bwork-dev/home/ubuntu/ml/SETUP_test_lance_datasets.ipynb#X14sdnNjb2RlLXJlbW90ZQ%3D%3D?line=0) dfl = dataset.to_table(
      [2](vscode-notebook-cell://ssh-remote%2Bwork-dev/home/ubuntu/ml/SETUP_test_lance_datasets.ipynb#X14sdnNjb2RlLXJlbW90ZQ%3D%3D?line=1)     # columns=["inference_date"], 
      [3](vscode-notebook-cell://ssh-remote%2Bwork-dev/home/ubuntu/ml/SETUP_test_lance_datasets.ipynb#X14sdnNjb2RlLXJlbW90ZQ%3D%3D?line=2)     limit=0, 
      [5](vscode-notebook-cell://ssh-remote%2Bwork-dev/home/ubuntu/ml/SETUP_test_lance_datasets.ipynb#X14sdnNjb2RlLXJlbW90ZQ%3D%3D?line=4)     filter=ds.field("inference_date") > pa.scalar(datetime.datetime.strptime('2020-01-05', "%Y-%m-%d"))
      [6](vscode-notebook-cell://ssh-remote%2Bwork-dev/home/ubuntu/ml/SETUP_test_lance_datasets.ipynb#X14sdnNjb2RlLXJlbW90ZQ%3D%3D?line=5) ).to_pandas()

File ~/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py:179, in LanceDataset.to_table(self, columns, filter, limit, offset, nearest)
    [138](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=137) def to_table(
    [139](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=138)     self,
    [140](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=139)     columns: Optional[list[str]] = None,
   (...)
    [144](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=143)     nearest: Optional[dict] = None,
    [145](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=144) ) -> pa.Table:
    [146](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=145)     """Read the data into memory as a pyarrow Table.
    [147](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=146) 
    [148](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=147)     Parameters
   (...)
    [177](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=176)     2. The results are filtered afterwards.
    [178](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=177)     """
--> [179](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=178)     return self.scanner(
    [180](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=179)         columns=columns, filter=filter, limit=limit, offset=offset, nearest=nearest
...
    [491](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=490)         self._columns, self._filter, self._limit, self._offset, self._nearest
    [492](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=491)     )
    [493](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=492)     return LanceScanner(scanner, self.ds)

JSpenced avatar Apr 18 '23 14:04 JSpenced

hi @JSpenced , thanks for the bug report. There's two issues here:

  1. Because Lance is Rust based, we don't really have access to the Arrow C++ compute expressions. So instead, we use datafusion under the hood and you can pass in sql where clause strings. I'll file a docs issue to clarify this difference
  2. Currently there's a feature gap for datetime columns where the code does not know to convert strings to datetime when appropriate. I'll file a separate issue for this.

changhiskhan avatar Apr 18 '23 16:04 changhiskhan

datetime string parsing is completed via https://github.com/lancedb/lance/pull/1026

There's some movement on pyarrow to make compute expressions convertible to str / datafusion. And long-term we may want to use substrait anyways

changhiskhan avatar Jul 02 '23 23:07 changhiskhan

the remaining issue with compute expression is tracked in #849 so I'm closing this issue

changhiskhan avatar Jul 02 '23 23:07 changhiskhan