lance
lance copied to clipboard
_rowaddr and _rowid not exposed for `merge_insert`
Sort of a follow up on #3251, I noticed that _rowid and _rowaddr doesn't seem to be usable for merge_insert, while it works for merge. When I try to use it with a subcol update, something like
import pyarrow as pa
import polars as pl
initial_data = pa.table(
{
"a": range(10),
"b": range(10),
"c": range(10, 20),
}
)
dataset = lance.write_dataset(
initial_data, "/tmp/lance/test2.lance"
)
new_values = pl.from_arrow(dataset.to_table(with_row_id=True)).select(pl.col("_rowid"), pl.col("a") * 2)
(dataset.merge_insert("a").when_matched_update_all().execute(new_values))
gives me
OSError: Append with different schema: fields did not match, missing=[b, c], unexpected=[_rowid], location: /Users/runner/work/lance/lance/rust/lance-core/src/datatypes/schema.rs:142:27
I think it's quite different from #3251 . because _rowid is managed by lance, we cannot insert _rowid into lance.
If I’m not mistaken, this doesn’t have anything to do with merge_insert, does it? You just want to update a() specific column(s), right?
https://github.com/lance-format/lance/pull/4715 already covers this for Fragments. @wjones127, should we expose an API at the dataset level 🤔