Incremental dataset (e.g. `.push_to_hub(..., append=True)`)
Feature request
Have the possibility to do ds.push_to_hub(..., append=True).
Motivation
Requested in this comment and this comment. Discussed internally on slack.
Your contribution
What I suggest to do for parquet datasets is to use CommitOperationCopy + CommitOperationDelete from huggingface_hub:
- list files
- copy files from parquet-0001-of-0004 to parquet-0001-of-0005
- delete files like parquet-0001-of-0004
- generate + add last parquet file parquet-0005-of-0005
=> make a single commit with all commit operations at once
I think it should be quite straightforward to implement. Happy to review a PR (maybe conflicting with the ongoing "1 commit push_to_hub" PR https://github.com/huggingface/datasets/pull/6269)
Yea I think waiting for #6269 would be best, or branching from it. For reference, this PR is progressing pretty well which will do similar using the hf hub for our LAION dataset bot https://github.com/LAION-AI/Discord-Scrapers/pull/2.
Is there any update on this?
Is there any update on this?
No update so far on this feature request but for broader context, this announce will help with incremental datasets https://huggingface.co/blog/xethub-joins-hf :)
Still no update? Whats the current recommended way to upload large datasets to the hub? I can't load it all to memory after some limit right
Still no update? Whats the current recommended way to upload large datasets to the hub? I can't load it all to memory after some limit right
@tolgadur bro, you may set 'data_dir' as a temporary solution like that:
Dataset.push_to_hub('your_repo_aame', data_dir='data_{index}')
Related to this feature request: pyarrow 21 is out and added content defined chunking for parquet, which enables deduped uploads to Xet.
Therefore we can have a logic in append=True that modifies an existing Parquet file to add more rows without having to reupload the full file.