UniControl icon indicating copy to clipboard operation
UniControl copied to clipboard

About multi-gpus training

Open YANDaoyu opened this issue 1 year ago • 2 comments

That's definitely an impressive work!

I'm trying to reproduce some results on inpainting task and had some concern about the data_parallel mode. Referring to the codes, batch_size is 4 for single GPU, total pairs of inpainting data is about 2.8m, thus the total log step is 700k. When I training it on 8-GPUs, the total step still log as 700k, then I've checked the GPU-memory usage -- all the GPU are nearly fully used. So I just wondering the training batch_size for 8-GPU is 4*8 or not? Or say there are some misalignment in logging?

Thanks for your time.

YANDaoyu avatar Mar 01 '24 03:03 YANDaoyu

same question. Looking forward to the answer. @canqin001 @shugerdou

gallenszl avatar Mar 12 '24 12:03 gallenszl

Thank you for this question. For multigpu training, the overall batch-size would be num_per_batch * num_batch. The 700k iterations is independent of the batch-size. So, it is needed to manually assign the iterations to match the overall computation cost.

canqin001 avatar Mar 13 '24 02:03 canqin001