streaming icon indicating copy to clipboard operation
streaming copied to clipboard

Integrating MDS Streaming with HF Dataset Streaming

Open siddk opened this issue 1 year ago • 8 comments

🚀 Feature Request

Hey folks - I've loved using streaming for some of my research in multimodal pretraining and robotics. One thing I'd love to support is first-class integration with HF Datasets (e.g., similar functionality to their WebDataset Streaming Integration).

I've created an issue on HF Datasets here, and @lhoestq seems receptive to the idea. At a low-level, not sure about the best way to implement this support. Would pointers/to talk this through!

Motivation

Mosaic Streaming from MDS is fantastic for large-scale, reproducible pretraining! For some of my larger datasets, supporting the ability to stream MDS shards stored on HF Datasets while training would be fantastic.

Thanks!

siddk avatar Mar 19 '24 14:03 siddk

Hey, this would be great! What did you have in mind regarding the implementation -- what should be done on Streaming's side?

snarayan21 avatar Mar 21 '24 15:03 snarayan21

It would be nice to stream datasets from HF using Streaming, e.g. supporting hf:// paths

lhoestq avatar Mar 21 '24 16:03 lhoestq

@lhoestq Would it be possible for the user to upload the MDS shard files in the hf:// paths? Or is your ask to support the HF remote path with whatever underlying files it can contain, such as Parquet, JSONL, etc?

karan6181 avatar Apr 02 '24 15:04 karan6181

At HF we want to make the Hub more open and support more data formats and libraries. We recently added support for WebDataset for example, and there are hundreds of datasets in WebDataset format on the HF Hub already.

Users can already upload data files in MDS format that they have locally using e.g. huggingface_hub. Maybe one day with the MDSWriter directly ? that would be cool !

Anyway what I think is the most interesting is if Streaming could stream datasets in MDS formats from HF (e.g. using hf:// paths). That would be useful to many researchers IMO

lhoestq avatar Apr 04 '24 11:04 lhoestq

Just following up on this; @karan6181 @lhoestq -- my understanding is that the HF Hub exposes dataset repositories via an fsspec API: https://huggingface.co/docs/huggingface_hub/main/en/guides/hf_file_system

From the Mosaic Streaming perspective -- can I just upload MDS shards to a Hub repo, and use the corresponding hf:// path as a drop-in replacement for an s3:// path?

Basically @karan6181 -- trying to figure out what "S3-compatible object store" really means under the hood vs. what the HF Hub natively supports.

siddk avatar Apr 17 '24 09:04 siddk

Just following up on this; @karan6181 @lhoestq -- my understanding is that the HF Hub exposes dataset repositories via an fsspec API: https://huggingface.co/docs/huggingface_hub/main/en/guides/hf_file_system From the Mosaic Streaming perspective -- can I just upload MDS shards to a Hub repo, and use the corresponding hf:// path as a drop-in replacement for an s3:// path?

Yes that's correct !

lhoestq avatar Apr 17 '24 17:04 lhoestq

Just following up on this; @karan6181 @lhoestq -- my understanding is that the HF Hub exposes dataset repositories via an fsspec API: https://huggingface.co/docs/huggingface_hub/main/en/guides/hf_file_system

From the Mosaic Streaming perspective -- can I just upload MDS shards to a Hub repo, and use the corresponding hf:// path as a drop-in replacement for an s3:// path?

Basically @karan6181 -- trying to figure out what "S3-compatible object store" really means under the hood vs. what the HF Hub natively supports.

@siddk It appears that the HF hub functions primarily as a cloud storage solution, accessible via the hf:// prefix. Integrating HF hub support into the streaming dataset should be straightforward. Do you have the capacity to implement HF hub backend support in the streaming dataset? You can model your work on the structure outlined in the PRs at https://github.com/mosaicml/streaming/pull/311 and https://github.com/mosaicml/streaming/pull/256. Please let us know if you have any questions—we're here to assist you.

karan6181 avatar Apr 23 '24 12:04 karan6181

Hey @karan6181 -- I'm a bit swamped with upcoming paper deadlines right now, but would love to see this supported. I can try carving out time to work on things in a few weeks, but wouldn't mind your expert take on this. I think the broader HF community would really appreciate it as well!

siddk avatar Apr 23 '24 13:04 siddk

Included in v0.8.0 release

mvpatel2000 avatar Jul 30 '24 17:07 mvpatel2000

Wow amazing ! are there some docs already on how to use it ?

Also let me know if you plan to share this on social media, I'll be happy to re-share with the community !

lhoestq avatar Jul 30 '24 21:07 lhoestq

Hey @lhoestq, @orionw added support for storing MDS datasets in huggingface. The relevant section in the docs is here. Will ask internally about posting on socials!

@orionw provided this simple script which shows off the new functionality:

from streaming import StreamingDataset

# Create streaming dataset
dataset = StreamingDataset(remote="hf://datasets/orionweller/wikipedia_mds/", shuffle=False, split=None, batch_size=1)

# Let's see what's in it
for sample in dataset:
    text = sample['text']
    id = sample['id']
    print(f"Text: {text}")
    print(f"ID: {id}")
    break

snarayan21 avatar Jul 30 '24 21:07 snarayan21

@lhoestq we tweeted here: https://x.com/DbrxMosaicAI/status/1818407826852921833 thanks!

snarayan21 avatar Jul 31 '24 03:07 snarayan21