Retrieval-based-Voice-Conversion-WebUI icon indicating copy to clipboard operation
Retrieval-based-Voice-Conversion-WebUI copied to clipboard

Slower Multi-GPU training with 2x the number of GPUs and 4x the amount of VRAM

Open scf4 opened this issue 2 years ago • 0 comments

I have two systems training on identical datasets

System A has 4 x NVIDIA RTX A5000 (24GB VRAM per GPU), and a batch size of 12 per GPU.

System B has 7 x NVIDIA RTX A6000 (48GB VRAM per GPU), and a batch size of 18 per GPU.

I would expect System B to train much faster. However...

  • System A (96GB total VRAM, batch size 12) takes 11 seconds per epoch.

  • System B (336GB total VRAM, batch size 18) takes 13 seconds per epoch.

I'm wondering if this is down to the overhead of multi-GPU training, or if there's something I'm missing here?

Thank you

scf4 avatar May 07 '23 11:05 scf4