MONAI PersistentDataset, CacheDataset improvements

It would nice to have several improvements to caching of data on disk and in memory.

For caching on disk with PersistentDataset,

the first epoch takes a long time, because saving is synchronous. I suggest saving on background queue, and returning data right away. This has a small probability that the same image might be saved twice, but on average it will have a speed up
the documentation description implies that that cache_dir is optional (but it's not) "..If specified, this is the location for persistent storage .."
The cached data is much larger on disk than the originals (and takes a lot of space). This is due to the original .nii.gz image is usually uint8/16 for images and uint8 for labels and gziped inside of nii.gz. I wonder if there is a way to do similar for the cache

For caching in memory with CacheDataset

on multigpu machine, each process gets its own cache, and they don't have access to any shared cache. It seems as of python 3.8 we can create a shared memory object https://docs.python.org/3/library/multiprocessing.shared_memory.html, so that all process can use it. it would be nice to have this, otherwise it's not very practical to use CacheDataset on multigpu machine. I know there is a way to manually partition data between processes, but then we need to worry about accuracy due to less random data sampling.
In memory caching runs before any training starts, which takes time. Can we do it on the fly (similar to PersistentDataset) as we iterate the first epoch, we add data into memory cache.

Thank you

Feb 23 '22 01:02 myron

Hi @myron ,

Thanks very much for your detailed feedback.

I can try to investigate the multi-thread saving logic later, see whether the total training time is really faster.
For the cache_dir arg, I submitted a closed PR before: https://github.com/Project-MONAI/MONAI/pull/3453, @wyli described some opinions in that PR.
If you are OK to save the data in uint8, you can put a CastToType transform before caching: https://github.com/Project-MONAI/MONAI/blob/dev/monai/transforms/utility/array.py#L313
Let me investigate this new shared object in python 3.8, the reason we didn't use shared memory in CacheDataset is that the IPC of shared memory makes training much slower.
If computing cache on the fly, we may need to use shared memory first, that's same question as above. Please note that the multi-gpu case is "multi-processing in multi-processing" because PyTorch DataLoader is based on python multiprocessing and distributed launch is based on python subprocess.

Thanks.

Feb 23 '22 03:02 Nic-Ma

If you are OK to save the data in uint8, you can put a CastToType transform That's a good suggestion, it should be smaller in size then, but I guess it wan't be compressed (gzipped). For images with many zeros (background) nii.gz compression reduces the size a lot

Thanks

Feb 23 '22 21:02 myron

ok, closing for now

Apr 04 '22 19:04 myron

One thing to note with the zipping/compression of the files. When I time profile training with datasets consisting of .nii.gz, the majority of the time is actually spent in the LoadImage, specifically decompressing the files. Hence, I think that the not compressing is the way to go on that front.

Apr 22 '22 15:04 rijobro

Huge thumbs up for the caching-on-the-fly 👍 This would greatly simplify Multi-GPU workflows, since the distributed batch sampler only uses a subset of data on each GPU. Further, caching all samples for each GPU already resulted in OOM problems for large datasets in my experiments.

Jun 03 '22 12:06 razorx89

Hi @myron -- sorry I'm only seeing this now, it's a great idea. To your third bullet point, however, I've noticed that decompressing images is by far the slowest transform when dealing with .nii.gz files. Hence, I wouldn't recommend compressing the images when saving them (or if you want to try, they need to be time profiled!).

Oct 13 '22 08:10 rijobro

closing this for now by https://github.com/Project-MONAI/MONAI/pull/5365, please create follow-up tickets if it's not 100% addressed.

Nov 09 '22 15:11 wyli