streaming icon indicating copy to clipboard operation
streaming copied to clipboard

Hf ingestion

Open XiaohanZhangCMU opened this issue 2 years ago • 0 comments

Description of changes:

Add a ingestion helper utility for Huggingface datasets downloading. Building on snapshot_download, some improvements include

  • Enable resume = True. retry when bad network happens
  • Disable progress_bar to prevent browser/terminal crash
  • Add a monitor to print out file stats every 15 seconds

Issue #, if available:

Merge Checklist:

Put an x without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

  • [ ] I have read the contributor guidelines
  • [ ] This is a documentation change or typo fix. If so, skip the rest of this checklist.
  • [ ] I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the MosaicML team.
  • [ ] I have updated any necessary documentation, including README and API docs (if appropriate).

Tests

  • [ ] I ran pre-commit on my change. (check out the pre-commit section of prerequisites)
  • [ ] I have added tests that prove my fix is effective or that my feature works (if appropriate).
  • [ ] I ran the tests locally to make sure it pass. (check out testing)
  • [ ] I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes.

XiaohanZhangCMU avatar Oct 23 '23 21:10 XiaohanZhangCMU