[Data][Experimental] Progress Tracking for Webdataset

Open tonyf opened this issue 2 years ago • 1 comments

Why are these changes needed?

When processing large datasets, ray data doesn't support any sort of resumption scheme. This is an experimental progress tracker for reading & writing webdataset datasets. It uses a ProgressTracker actor keep track of shards and keys that have already been processed such that rows that have already been processed can be skipped.

The WebdatasetSource source now skips keys & tarfiles that have already been processed. The WebdatasetSink reports successfully written keys & paths to the ProgressTracker.

This doesn't affect how ray writes webdatasets. Each run will have a new dataset uuid, but since the webdataset library doesn't particularly care about the naming shards so long as they are sequential, this works.

Checks

[x] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
[x] I've run scripts/format.sh to lint the changes in this PR.
[ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
[ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [x] This PR is not tested :(

Mar 19 '24 17:03 tonyf

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

Apr 22 '24 06:04 stale[bot]