As described in #1885 iterating a small dataset with ds.tensorflow() is very slow.

If using the tensorflow dataset on a hub dataset which consist of multiple samples with small size, the iteration becomes considerably faster if using fetch_chunks=True when getting the data. This commit adds the ability to pass that option when creating the tensorflow dataset from the hub dataset.

On an MNIST dataset the result is 25 times faster iteration.

Benchmark can be found here: https://gist.github.com/daniel-falk/188b96013e9f0cedcf555a0a30fa177d

TF chunks: 957 frames / second TF current: 26 frames / second Hub chunks: 1294 frames / second Hub per sample: 28 frames / second

🚀 🚀 Pull Request

Checklist:

[x] My code follows the style guidelines of this project and the Contributing document
[x] I have commented my code, particularly in hard-to-understand areas
[ ] I have kept the coverage-rate up
[x] I have performed a self-review of my own code and resolved any problems
[ ] I have checked to ensure there aren't any other open Pull Requests for the same change
[ ] I have described and made corresponding changes to the relevant documentation
[ ] New and existing unit tests pass locally with my changes

Sep 19 '22 18:09 daniel-falk

I can also add to the benchmarking that by batching the samples I reach essentially the same throughput with tensorflow (1245 frames / second) as with the hub dataset directly (1294 frames / second):

batch_size = 128
iter(hub_ds.tensorflow(fetch_chunks=True).batch(batch_size)))

Sep 19 '22 18:09 daniel-falk

@daniel-falk thank you so much for the contribution, this is 🔥 💣 . Looping in @tatevikh to assign reviewers.

Sep 19 '22 18:09 mikayelh

I created a simple example of how I typically use the datasets to measure the timing with and without chunked fetching, that did however result in an exception when training with the dataset with fetch_chunks=True... This needs to be investigated.

hub.util.exceptions.ReadOnlyModeError: Modification when in read-only mode is not supported!

Full example is here: https://gist.github.com/daniel-falk/ec6b0dab5514b6ad54e08b666bc841b7

Sep 22 '22 18:09 daniel-falk

Hey @daniel-falk ! Thanks for reporting the issue. #1911 should fix it and we will include it in the next release.

Sep 29 '22 13:09 AbhinavTuli

Thanks @AbhinavTuli , with that fix it works good. With the large images in my "predict weather" example above, the training of the model was ~30% faster when fetching chunks.

I have changed the default value to True for fetch_chunks and updated to docstring to deeplake instead of hub.

Oct 02 '22 18:10 daniel-falk

Codecov Report

Base: 89.86% // Head: 89.86% // No change to project coverage :thumbsup:

Coverage data is based on head (9561736) compared to base (a57a747). Patch coverage: 66.66% of modified lines in pull request are covered.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1887   +/-   ##
=======================================
  Coverage   89.86%   89.86%           
=======================================
  Files         248      248           
  Lines       26493    26493           
=======================================
  Hits        23809    23809           
  Misses       2684     2684

Flag	Coverage Δ
unittests	`89.86% <66.66%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
deeplake/integrations/tf/datasettotensorflow.py	`61.36% <50.00%> (ø)`
deeplake/core/dataset/dataset.py	`92.90% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

Oct 03 '22 06:10 codecov[bot]

@daniel-falk , this is the first oss contribution to Deep Lake! Yay! :) hit you up in the community channel for some swag goodies. :)

Oct 04 '22 10:10 mikayelh

Allow tensorflow dataset to fetch chunks

🚀 🚀 Pull Request

Checklist:

Codecov Report