diffusers Parallel dataloader for unconditional diffusion example

Is your feature request related to a problem? Please describe. The loading of the samples is sequential and without prefetching n the train_unconditional.py script. This results in an underutilization of the GPU if large images and/or "slow" IO are used.

Describe the solution you'd like Adding a parameter, which sets the num_workers parameter of the PyTorch dataloader, solves this issue by loading the samples in parallel using multiple workers.

Describe alternatives you've considered

Leave it as it is and accept a slower training
I am not that familiar with accelerate, but maybe we can get the number of workers from there instead of introducing a new parameter.

Additional context

Tested locally (1x 2080 ti + 32 core Threadripper) using our own dataset (https://github.com/bit-bots/TORSO_21_dataset). Resulted in 140% speedup as well as a higher GPU utilization in nvtop.

Oct 04 '22 09:10 Flova

cc @anton-l

Oct 04 '22 13:10 patrickvonplaten

Any updates here @anton-l ?

Oct 27 '22 08:10 patrickvonplaten

Thanks for the feedback @Flova! Added the parameter in https://github.com/huggingface/diffusers/pull/1027

Oct 27 '22 13:10 anton-l