About multi-gpus training
That's definitely an impressive work!
I'm trying to reproduce some results on inpainting task and had some concern about the data_parallel mode. Referring to the codes, batch_size is 4 for single GPU, total pairs of inpainting data is about 2.8m, thus the total log step is 700k. When I training it on 8-GPUs, the total step still log as 700k, then I've checked the GPU-memory usage -- all the GPU are nearly fully used. So I just wondering the training batch_size for 8-GPU is 4*8 or not? Or say there are some misalignment in logging?
Thanks for your time.
same question. Looking forward to the answer. @canqin001 @shugerdou
Thank you for this question. For multigpu training, the overall batch-size would be num_per_batch * num_batch. The 700k iterations is independent of the batch-size. So, it is needed to manually assign the iterations to match the overall computation cost.