lance Generated Columns

We are using lance to store text corpus, but before training, some lightweight normalizations should be applied to the text, for example, removing sensitive words. currently, we have to store both normalized text and unnormalized text. the storage is doubled.

maybe lance can implement generated column which is similar to https://www.sqlite.org/gencol.html.

the benifits are:

save storage size
better IO performance
no need to update normalized text any more.

Jan 11 '25 06:01 chenkovsky

@westonpace FYI

Jan 14 '25 02:01 LuQQiu

another scenario is that, we have multi model corpus for trainning. but videos are too large, and we want to share videos or images between different datasets. so videos or large images are stored solely, we only store url in lance. with generated columns, user can get videos or images as if stored in lance. they don't need to care about where are images or videos.

sometimes images are also extracted from videos. with this feature. we can extract image on the fly.

Jan 14 '25 03:01 chenkovsky

This seems reasonable. We'd need to track the function itself as well as which columns are source columns for the function. Most of the work would end up in the scanner keeping track of the various schema (but I believe you are also the one that suggested we introduce metadata columns? So maybe some of that schema refactor is overdue)

Today, lightweight functions can be applied as a projection when you query the data:

>>> import pyarrow as pa
>>> tab = pa.table({"a": [-3, -2, -1, 0, 1, 2, 3]})
>>> import lance
>>> ds = lance.write_dataset(tab, "/tmp/foo.lance")
>>> ds.to_table(columns={"a_sq": "a*a"})
pyarrow.Table
a_sq: int64
----
a_sq: [[9,4,1,0,1,4,9]]

another scenario is that, we have multi model corpus for trainning. but videos are too large, and we want to share videos or images between different datasets. so videos or large images are stored solely, we only store url in lance. with generated columns, user can get videos or images as if stored in lance. they don't need to care about where are images or videos.

sometimes images are also extracted from videos. with this feature. we can extract image on the fly.

This is might be doable, but for more expensive tasks it might make more sense to either do this as a dedicated feature and / or do this later in the pipeline (e.g. in a python iterator). Though for video -> image you could maybe get away with a datafusion UDF (I feel like memory might get a bit tricky but maybe solved with batch sizes).

Jan 14 '25 14:01 westonpace

yes, we can use

ds.to_table(columns={"a_sq": "a*a"})

the only difference is that, data producer can be decoupled with data consumer. otherwise when sql logic is changed, data producer has to notify every user.

Jan 15 '25 01:01 chenkovsky

(but I believe you are also the one that suggested we introduce metadata columns? So maybe some of that schema refactor is overdue)

after my pr for datafusion is merged, I can help to refactor the schema.

Jan 15 '25 02:01 chenkovsky