datasets Minimizing memory usage with a large custom dataset (possible memory leak with first epoch)

I've written a custom dataset with the tfds cli (a GeneratorBasedBuilder without Beam). Overall the dataset is ~60 GB and is sourced from manually downloaded hdf5 files with mostly float32s inside.

I'm encountering an issue where when iterating through the dataset it consumes a huge amount of memory; much more than I'm thinking it should. Seemingly as it's iterating the dataset, TensorFlow is attempting to cache or is losing track of memory. Specifically, when iterating over 20% (12 GB) of the dataset the memory usage tops out at around 17 GB. After the first epoch, it slows down its growth dramatically

What I'm wondering I am wondering in what ways does tfds apply a cache when building? In addition, are there any configurations (when building or loading), that I might be able to try to limit the memory impact of my dataset?

On the tfds side of things I have already tried setting the following read configurations tfds.ReadConfig(try_autocache=False, skip_prefetch=True) However, this seemingly only affected the speed of iterating through the dataset and not the amount of memory used as I would expect.

I've been trying to read through the documentation of both tfds.ReadConfig and tfds.load but haven't really seen anything other than these two options.

In addition, I've profiled my heap using tcmalloc and have found that the allocations are coming from reading in the data. Most of those allocations are sitting in memory and not being used at any specific time.

Environment information

Operating System: WSL 2.0 with Ubuntu 20.04
Python version: 3.8.10
tfds-nightly version: 4.6.0.dev202207180044
tf-nightly version: 2.11.0.dev20220805

Aug 16 '22 16:08 Dragonfire3900

Any sol?

Dec 29 '23 19:12 ahmadmustafaanis

Possibly related, I have also run into issues where a huge amount of memory is being used when attempting to do tfds.load() on a custom video dataset..

https://github.com/sign-language-processing/datasets/issues/68

I spent a lot of time trying to debug this using a high-memory instance on Colab Pro, and in particular I broke the load operation apart into it's consitutuent parts, and found that the massive memory allocation was happening during download_and_prepare, before the dataset even actually got loaded, in the split generation portion. I kept drilling down using various techniques, eventually using memray.

Here's the flamegraph results:

huge allocations with the serialization process apparently!

When I stepped through it with a debugger and added print statements, this is all encoding one example, containing 3 videos in it. None of the videos is over 100MB in size, I don't really understand why this takes 30GiB of RAM.

Mar 29 '24 15:03 cleong110