datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Incremental dataset (e.g. `.push_to_hub(..., append=True)`)

Open Wauplin opened this issue 2 years ago • 4 comments

Feature request

Have the possibility to do ds.push_to_hub(..., append=True).

Motivation

Requested in this comment and this comment. Discussed internally on slack.

Your contribution

What I suggest to do for parquet datasets is to use CommitOperationCopy + CommitOperationDelete from huggingface_hub:

  1. list files
  2. copy files from parquet-0001-of-0004 to parquet-0001-of-0005
  3. delete files like parquet-0001-of-0004
  4. generate + add last parquet file parquet-0005-of-0005

=> make a single commit with all commit operations at once

I think it should be quite straightforward to implement. Happy to review a PR (maybe conflicting with the ongoing "1 commit push_to_hub" PR https://github.com/huggingface/datasets/pull/6269)

Wauplin avatar Oct 10 '23 15:10 Wauplin

Yea I think waiting for #6269 would be best, or branching from it. For reference, this PR is progressing pretty well which will do similar using the hf hub for our LAION dataset bot https://github.com/LAION-AI/Discord-Scrapers/pull/2.

ZachNagengast avatar Oct 13 '23 16:10 ZachNagengast

Is there any update on this?

nqyy avatar Jul 18 '24 23:07 nqyy

Is there any update on this?

Elfsong avatar Aug 31 '24 17:08 Elfsong

No update so far on this feature request but for broader context, this announce will help with incremental datasets https://huggingface.co/blog/xethub-joins-hf :)

Wauplin avatar Sep 02 '24 08:09 Wauplin

Still no update? Whats the current recommended way to upload large datasets to the hub? I can't load it all to memory after some limit right

tolgadur avatar Mar 12 '25 09:03 tolgadur

Still no update? Whats the current recommended way to upload large datasets to the hub? I can't load it all to memory after some limit right

@tolgadur bro, you may set 'data_dir' as a temporary solution like that:

Dataset.push_to_hub('your_repo_aame', data_dir='data_{index}')

Elfsong avatar Mar 12 '25 09:03 Elfsong

Related to this feature request: pyarrow 21 is out and added content defined chunking for parquet, which enables deduped uploads to Xet.

Therefore we can have a logic in append=True that modifies an existing Parquet file to add more rows without having to reupload the full file.

lhoestq avatar Aug 13 '25 12:08 lhoestq