AdaptSegNet icon indicating copy to clipboard operation
AdaptSegNet copied to clipboard

Mutli gpu training

Open kshitijagrwl opened this issue 7 years ago • 11 comments

Hi currently training on GTA2Cityscapes takes 2 days for 100k epochs which is very slow. How can I make this run in multi gpu?

kshitijagrwl avatar Sep 06 '18 06:09 kshitijagrwl

If you do mean 100k EPOCHes, it is not slow dude. Try nn.dataparallel(model) to run on multiple GPU and you can find tutorial here https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html

wppply avatar Sep 07 '18 07:09 wppply

Oops, I meant 100k iterations! Thanks for the link, will try out and update.

kshitijagrwl avatar Sep 07 '18 12:09 kshitijagrwl

I have a image computing PC with cpu(16g) and gpu(8g) ,is it enough to train the model without throwing a CUDA out of Memory error,please?

SiyuanWei avatar Oct 24 '18 07:10 SiyuanWei

I forgot details, but cityscapes dataset usually required 11GB GPU based on my experience.

wppply avatar Oct 24 '18 07:10 wppply

thanks anyway ,although it is a bad news

SiyuanWei avatar Oct 24 '18 10:10 SiyuanWei

@kshitijagrwl hi, Have you completed the multi-GPU version?

ypjian avatar Dec 12 '18 02:12 ypjian

have anyone tried multi-GPU version? I want to train with multi-GPU. please provide the way to train multigpu.

lerndeep avatar Mar 15 '20 11:03 lerndeep

@lerndeep I'm trying it now, running into a few bugs (fairly new to PyTorch). Will update here if/when I get it working

lychrel avatar Mar 21 '20 20:03 lychrel

@kshitijagrwl @lychrel Are you finishing the multi gpu computing? Looking forward to your reply!

Lufei-github avatar Dec 30 '20 06:12 Lufei-github

@Lufei-github Tried it a couple times and couldn't avoid a memory leak that reboots my computer. I don't have this problem elsewhere, even in similar contexts (DeepLab)—but this is also a super simple training loop, so the culprit shouldn't be hard to find.

Ended up using different DA methods for the project I was working on, but I'd be curious to hear if anyone else experiences this behavior. Though I switched to a different problem, ASN gave really compelling results after letting the single-GPU jobs run.

lychrel avatar Dec 30 '20 07:12 lychrel

@lychrel I don't really understand your answer. I don't kow what is ASN. So can you answer me with a simply way?

Lufei-github avatar Dec 30 '20 07:12 Lufei-github