Nathan Yan
Nathan Yan
Problems solved for now! In case people might encounter a similar issue, If you use single node multiple GPU, replace the DDP with the following, there is a hacky way,...
Hi @wpeebles , thanks for your reply! Yea, I agree with using torch.distributed is always a better choice. Yes, it seems the problem comes back again somehow now -- it...
This looks really awesome! @yukang2017 I was doing something similar, but when I resume from a checkpoint, it seems the loss is not exactly going down from the point where...
@achen46 I am also having a similar question whether this is ude to the EMA weights, but I thought EMA weights has been stored right?