litdata icon indicating copy to clipboard operation
litdata copied to clipboard

Fast random access for `StreamingDataset`

Open ethanwharris opened this issue 1 year ago • 3 comments

🚀 Feature

Support a way to request just a single sample from a StreamingDataset without internally pulling the whole chunk.

Motivation

Streaming chunks is great for cases where you want to visit the whole dataset but sub-optimal if you just want to view individual samples. Right now, if you just index a StreamingDataset directly the latency is very high. This is a bit of an issue if you want to explore the dataset (e.g. in a streamlit or gradio app).

Pitch

We could have a way to request a single sample from the dataset that would download only the bytes of that sample instead of downloading the whole chunk. This would enable building visualizations etc. on top of streaming datasets.

Alternatives

Additional context

ethanwharris avatar Feb 23 '24 09:02 ethanwharris

Hey @ethanwharris , we have the feature to subsample from the dataset. Though, the subsamples are optimized to be from as few chunks as possible. Indexing and slicing is also supported.

They don't exactly fulfill your requirements, but I believe, these features address them effectively.

from litdata import StreamingDataset, train_test_split

dataset = StreamingDataset("s3://my-bucket/my-data", subsample=0.01) # data are stored in the cloud

print(len(dataset)) # display the length of your data
# out: 1000

Without unpacking the bin file, it might be challenging to get the exact item. Encrypted chunks pose another challenge for the same. But, please let us know if this is what you would like to have.

Else, you can close the issue.

deependujha avatar Jul 25 '24 19:07 deependujha

The goal here was to add support for multi range fetching from the client side, so we don't fetch the entire binary file but the only what the user requests.

tchaton avatar Jul 26 '24 10:07 tchaton

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 16 '25 05:04 stale[bot]