[Data][Experimental] Progress Tracking for Webdataset
Why are these changes needed?
When processing large datasets, ray data doesn't support any sort of resumption scheme. This is an experimental progress tracker for reading & writing webdataset datasets. It uses a ProgressTracker actor keep track of shards and keys that have already been processed such that rows that have already been processed can be skipped.
The WebdatasetSource source now skips keys & tarfiles that have already been processed. The WebdatasetSink reports successfully written keys & paths to the ProgressTracker.
This doesn't affect how ray writes webdatasets. Each run will have a new dataset uuid, but since the webdataset library doesn't particularly care about the naming shards so long as they are sequential, this works.
Checks
- [x] I've signed off every commit(by using the -s flag, i.e.,
git commit -s) in this PR. - [x] I've run
scripts/format.shto lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
doc/source/tune/api/under the corresponding.rstfile.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [x] This PR is not tested :(
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
- If you'd like to keep this open, just leave any comment, and the stale label will be removed.