采用推荐脚本训练,在3万步左右会出现Loss is nan, stopping training,又遇到的嘛?
Loss is nan, stopping training {'pt_loss': tensor(nan, device='cuda:0', grad_fn=<NllLoss2DBackward0>), 'poly_loss': tensor(nan, device='cuda:0', grad_fn=<NllLoss2DBackward0>), 'rec_loss': tensor(3.2648, device='cuda:0', grad_fn=<NllLoss2DBackward0>)}
请问你是跑Omni的训练吗,在第二阶段预训练有遇到loss is nan的情况,目前不知道怎么解决
请问你是跑Omni的训练吗,在第二阶段预训练有遇到loss is nan的情况,目前不知道怎么解决
我stage1一开始就nan了 Loss is nan, stopping training {'poly_loss': tensor(11.0486, device='cuda:0'), 'pt_loss': tensor(11.0483, device='cuda:0'), 'rec_loss': tensor(nan, device='cuda:0')}
+1,Omini stage1遇到: Epoch: [5] [19770/26922] eta: 0:36:55 lr: 0.000311 loss: 12.8984 (12.8393) pt_loss: 6.1091 (6.0516) poly_loss: 3.1889 (3.1327) rec_loss: 3.4215 (3.6550) time: 0.3092 data: 0.0097 max mem: 17167 Loss is nan, stopping training {'pt_loss': tensor(5.8353, device='cuda:0', grad_fn=<NllLoss2DBackward0>), 'poly_loss': tensor(nan, device='cuda:0', grad_fn=<NllLoss2DBackward0>), 'rec_loss': tensor(1.7320, device='cuda:0', grad_fn=<NllLoss2DBackward0>)}