datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Metadata for each column

Open parsa-ra opened this issue 2 years ago • 5 comments

Feature request

Being able to put some metadata for each column as a string or any other type.

Motivation

I will bring the motivation by an example, lets say we are experimenting with embedding produced by some image encoder network, and we want to iterate through a couple of preprocessing and see which one works better in our downstream task, here as workaround right now what I do is the compute the hash of the preprocessing that the images went through as part of the new columns name, it would be nice to attach some kinda meta data in these scenarios to the each columns. metadata

Your contribution

Maybe we could map another relational like database as the metadata?

parsa-ra avatar Feb 24 '23 10:02 parsa-ra

Hi! Indeed it would be useful to support this. PyArrow natively supports schema-level and column-level metadata, so implementing this should be straightforward. The API I have in mind would work as follows:

col_feature = Value("string", metadata="Some column-level metadata")

features = Features({"col": col_feature}, metadata="Some schema-level metadata")

WDYT?

mariosasko avatar Feb 27 '23 14:02 mariosasko

Sorry for the late reply, Yes, I think this is the most straight-forward approach with the things that we already have.

parsa-ra avatar Mar 05 '23 11:03 parsa-ra

@mariosasko Let me know how I can help.

parsa-ra avatar Mar 10 '23 17:03 parsa-ra

Hi, is this feature to be implemented in the near future? It would be really nice if that would be the case!

mmlynarik avatar Dec 04 '23 20:12 mmlynarik

Hi, I also need this feature for tell my customer if any of the feature is encrypted with a certain key.

felixgao avatar Jan 05 '24 21:01 felixgao