Parallel dataloader for unconditional diffusion example
Is your feature request related to a problem? Please describe.
The loading of the samples is sequential and without prefetching n the train_unconditional.py script.
This results in an underutilization of the GPU if large images and/or "slow" IO are used.
Describe the solution you'd like
Adding a parameter, which sets the num_workers parameter of the PyTorch dataloader, solves this issue by loading the samples in parallel using multiple workers.
Describe alternatives you've considered
- Leave it as it is and accept a slower training
- I am not that familiar with
accelerate, but maybe we can get the number of workers from there instead of introducing a new parameter.
Additional context
- Tested locally (1x 2080 ti + 32 core Threadripper) using our own dataset (https://github.com/bit-bots/TORSO_21_dataset). Resulted in 140% speedup as well as a higher GPU utilization in
nvtop.
cc @anton-l
Any updates here @anton-l ?
Thanks for the feedback @Flova! Added the parameter in https://github.com/huggingface/diffusers/pull/1027