Nathan Yan comments

Repositories
Issues
Comments

Results 4 comments of


                                            Nathan Yan

training gets frozen while using multiple-GPUs

Problems solved for now! In case people might encounter a similar issue, If you use single node multiple GPU, replace the DDP with the following, there is a hacky way,...

training gets frozen while using multiple-GPUs

Hi @wpeebles , thanks for your reply! Yea, I agree with using torch.distributed is always a better choice. Yes, it seems the problem comes back again somehow now -- it...

Resume training from a checkpoint

This looks really awesome! @yukang2017 I was doing something similar, but when I resume from a checkpoint, it seems the loss is not exactly going down from the point where...

Resume training from a checkpoint

@achen46 I am also having a similar question whether this is ude to the EMA weights, but I thought EMA weights has been stored right?