lance icon indicating copy to clipboard operation
lance copied to clipboard

Generated Columns

Open chenkovsky opened this issue 1 year ago • 5 comments

We are using lance to store text corpus, but before training, some lightweight normalizations should be applied to the text, for example, removing sensitive words. currently, we have to store both normalized text and unnormalized text. the storage is doubled.

maybe lance can implement generated column which is similar to https://www.sqlite.org/gencol.html.

the benifits are:

  • save storage size
  • better IO performance
  • no need to update normalized text any more.

chenkovsky avatar Jan 11 '25 06:01 chenkovsky

@westonpace FYI

LuQQiu avatar Jan 14 '25 02:01 LuQQiu

another scenario is that, we have multi model corpus for trainning. but videos are too large, and we want to share videos or images between different datasets. so videos or large images are stored solely, we only store url in lance. with generated columns, user can get videos or images as if stored in lance. they don't need to care about where are images or videos.

sometimes images are also extracted from videos. with this feature. we can extract image on the fly.

chenkovsky avatar Jan 14 '25 03:01 chenkovsky

This seems reasonable. We'd need to track the function itself as well as which columns are source columns for the function. Most of the work would end up in the scanner keeping track of the various schema (but I believe you are also the one that suggested we introduce metadata columns? So maybe some of that schema refactor is overdue)

Today, lightweight functions can be applied as a projection when you query the data:

>>> import pyarrow as pa
>>> tab = pa.table({"a": [-3, -2, -1, 0, 1, 2, 3]})
>>> import lance
>>> ds = lance.write_dataset(tab, "/tmp/foo.lance")
>>> ds.to_table(columns={"a_sq": "a*a"})
pyarrow.Table
a_sq: int64
----
a_sq: [[9,4,1,0,1,4,9]]

another scenario is that, we have multi model corpus for trainning. but videos are too large, and we want to share videos or images between different datasets. so videos or large images are stored solely, we only store url in lance. with generated columns, user can get videos or images as if stored in lance. they don't need to care about where are images or videos.

sometimes images are also extracted from videos. with this feature. we can extract image on the fly.

This is might be doable, but for more expensive tasks it might make more sense to either do this as a dedicated feature and / or do this later in the pipeline (e.g. in a python iterator). Though for video -> image you could maybe get away with a datafusion UDF (I feel like memory might get a bit tricky but maybe solved with batch sizes).

westonpace avatar Jan 14 '25 14:01 westonpace

yes, we can use

ds.to_table(columns={"a_sq": "a*a"})

the only difference is that, data producer can be decoupled with data consumer. otherwise when sql logic is changed, data producer has to notify every user.

chenkovsky avatar Jan 15 '25 01:01 chenkovsky

(but I believe you are also the one that suggested we introduce metadata columns? So maybe some of that schema refactor is overdue)

after my pr for datafusion is merged, I can help to refactor the schema.

chenkovsky avatar Jan 15 '25 02:01 chenkovsky