Slower Multi-GPU training with 2x the number of GPUs and 4x the amount of VRAM

Open scf4 opened this issue 2 years ago • 0 comments

I have two systems training on identical datasets

System A has 4 x NVIDIA RTX A5000 (24GB VRAM per GPU), and a batch size of 12 per GPU.

System B has 7 x NVIDIA RTX A6000 (48GB VRAM per GPU), and a batch size of 18 per GPU.

I would expect System B to train much faster. However...

I'm wondering if this is down to the overhead of multi-GPU training, or if there's something I'm missing here?

Thank you

May 07 '23 11:05 scf4