Filtering on dates when using `to_table`
I was trying to test filtering and the primary thing I would want to filter for one example dataset is the date. I tried to get the date working and looking through a bunch of pyarrow code, but I couldn't figure it out. Just checking if anyone knows if it is supported and if there is an example somewhere or how to get dates working correctly.
dfl = dataset.to_table(
columns=["inference_date"],
limit=0,
# also doesn't work
# filter=ds.field("inference_date") > pd.to_datetime("2020-01-05')
filter=ds.field("inference_date") > pa.scalar(datetime.datetime.strptime('2020-01-05', "%Y-%m-%d"))
).to_pandas()
Error is ValueError: LanceError(I/O): sql parser error: Expected ), found: 00.
/home/ubuntu/ml/SETUP_test_lance_datasets.ipynb Cell 12' in <cell line: 1>()
----> [1](vscode-notebook-cell://ssh-remote%2Bwork-dev/home/ubuntu/ml/SETUP_test_lance_datasets.ipynb#X14sdnNjb2RlLXJlbW90ZQ%3D%3D?line=0) dfl = dataset.to_table(
[2](vscode-notebook-cell://ssh-remote%2Bwork-dev/home/ubuntu/ml/SETUP_test_lance_datasets.ipynb#X14sdnNjb2RlLXJlbW90ZQ%3D%3D?line=1) # columns=["inference_date"],
[3](vscode-notebook-cell://ssh-remote%2Bwork-dev/home/ubuntu/ml/SETUP_test_lance_datasets.ipynb#X14sdnNjb2RlLXJlbW90ZQ%3D%3D?line=2) limit=0,
[5](vscode-notebook-cell://ssh-remote%2Bwork-dev/home/ubuntu/ml/SETUP_test_lance_datasets.ipynb#X14sdnNjb2RlLXJlbW90ZQ%3D%3D?line=4) filter=ds.field("inference_date") > pa.scalar(datetime.datetime.strptime('2020-01-05', "%Y-%m-%d"))
[6](vscode-notebook-cell://ssh-remote%2Bwork-dev/home/ubuntu/ml/SETUP_test_lance_datasets.ipynb#X14sdnNjb2RlLXJlbW90ZQ%3D%3D?line=5) ).to_pandas()
File ~/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py:179, in LanceDataset.to_table(self, columns, filter, limit, offset, nearest)
[138](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=137) def to_table(
[139](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=138) self,
[140](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=139) columns: Optional[list[str]] = None,
(...)
[144](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=143) nearest: Optional[dict] = None,
[145](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=144) ) -> pa.Table:
[146](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=145) """Read the data into memory as a pyarrow Table.
[147](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=146)
[148](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=147) Parameters
(...)
[177](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=176) 2. The results are filtered afterwards.
[178](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=177) """
--> [179](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=178) return self.scanner(
[180](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=179) columns=columns, filter=filter, limit=limit, offset=offset, nearest=nearest
...
[491](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=490) self._columns, self._filter, self._limit, self._offset, self._nearest
[492](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=491) )
[493](file:///home/ubuntu/miniconda/envs/opcity-ml/lib/python3.9/site-packages/lance/dataset.py?line=492) return LanceScanner(scanner, self.ds)
hi @JSpenced , thanks for the bug report. There's two issues here:
- Because Lance is Rust based, we don't really have access to the Arrow C++ compute expressions. So instead, we use datafusion under the hood and you can pass in sql where clause strings. I'll file a docs issue to clarify this difference
- Currently there's a feature gap for datetime columns where the code does not know to convert strings to datetime when appropriate. I'll file a separate issue for this.
datetime string parsing is completed via https://github.com/lancedb/lance/pull/1026
There's some movement on pyarrow to make compute expressions convertible to str / datafusion. And long-term we may want to use substrait anyways
the remaining issue with compute expression is tracked in #849 so I'm closing this issue