Allow tensorflow dataset to fetch chunks
As described in #1885 iterating a small dataset with ds.tensorflow() is very slow.
If using the tensorflow dataset on a hub dataset which consist of multiple samples with small size, the iteration becomes considerably faster if using fetch_chunks=True when getting the data. This commit adds the ability to pass that option when creating the tensorflow dataset from the hub dataset.
On an MNIST dataset the result is 25 times faster iteration.
Benchmark can be found here: https://gist.github.com/daniel-falk/188b96013e9f0cedcf555a0a30fa177d
TF chunks: 957 frames / second TF current: 26 frames / second Hub chunks: 1294 frames / second Hub per sample: 28 frames / second
🚀 🚀 Pull Request
Checklist:
- [x] My code follows the style guidelines of this project and the Contributing document
- [x] I have commented my code, particularly in hard-to-understand areas
- [ ] I have kept the
coverage-rateup - [x] I have performed a self-review of my own code and resolved any problems
- [ ] I have checked to ensure there aren't any other open Pull Requests for the same change
- [ ] I have described and made corresponding changes to the relevant documentation
- [ ] New and existing unit tests pass locally with my changes
I can also add to the benchmarking that by batching the samples I reach essentially the same throughput with tensorflow (1245 frames / second) as with the hub dataset directly (1294 frames / second):
batch_size = 128
iter(hub_ds.tensorflow(fetch_chunks=True).batch(batch_size)))
@daniel-falk thank you so much for the contribution, this is 🔥 💣 . Looping in @tatevikh to assign reviewers.
I created a simple example of how I typically use the datasets to measure the timing with and without chunked fetching, that did however result in an exception when training with the dataset with fetch_chunks=True... This needs to be investigated.
hub.util.exceptions.ReadOnlyModeError: Modification when in read-only mode is not supported!
Full example is here: https://gist.github.com/daniel-falk/ec6b0dab5514b6ad54e08b666bc841b7
Hey @daniel-falk ! Thanks for reporting the issue. #1911 should fix it and we will include it in the next release.
Thanks @AbhinavTuli , with that fix it works good. With the large images in my "predict weather" example above, the training of the model was ~30% faster when fetching chunks.
I have changed the default value to True for fetch_chunks and updated to docstring to deeplake instead of hub.
Codecov Report
Base: 89.86% // Head: 89.86% // No change to project coverage :thumbsup:
Coverage data is based on head (
9561736) compared to base (a57a747). Patch coverage: 66.66% of modified lines in pull request are covered.
Additional details and impacted files
@@ Coverage Diff @@
## main #1887 +/- ##
=======================================
Coverage 89.86% 89.86%
=======================================
Files 248 248
Lines 26493 26493
=======================================
Hits 23809 23809
Misses 2684 2684
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 89.86% <66.66%> (ø) |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Impacted Files | Coverage Δ | |
|---|---|---|
| deeplake/integrations/tf/datasettotensorflow.py | 61.36% <50.00%> (ø) |
|
| deeplake/core/dataset/dataset.py | 92.90% <100.00%> (ø) |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
@daniel-falk , this is the first oss contribution to Deep Lake! Yay! :) hit you up in the community channel for some swag goodies. :)