AdvancedLiterateMachinery icon indicating copy to clipboard operation
AdvancedLiterateMachinery copied to clipboard

采用推荐脚本训练,在3万步左右会出现Loss is nan, stopping training,又遇到的嘛?

Open JinJiTongXue opened this issue 1 year ago • 3 comments

Loss is nan, stopping training {'pt_loss': tensor(nan, device='cuda:0', grad_fn=<NllLoss2DBackward0>), 'poly_loss': tensor(nan, device='cuda:0', grad_fn=<NllLoss2DBackward0>), 'rec_loss': tensor(3.2648, device='cuda:0', grad_fn=<NllLoss2DBackward0>)}

JinJiTongXue avatar Jul 30 '24 05:07 JinJiTongXue

请问你是跑Omni的训练吗,在第二阶段预训练有遇到loss is nan的情况,目前不知道怎么解决

lyb18758 avatar Aug 12 '24 09:08 lyb18758

请问你是跑Omni的训练吗,在第二阶段预训练有遇到loss is nan的情况,目前不知道怎么解决

我stage1一开始就nan了 Loss is nan, stopping training {'poly_loss': tensor(11.0486, device='cuda:0'), 'pt_loss': tensor(11.0483, device='cuda:0'), 'rec_loss': tensor(nan, device='cuda:0')}

madajie9 avatar Jan 16 '25 09:01 madajie9

+1,Omini stage1遇到: Epoch: [5] [19770/26922] eta: 0:36:55 lr: 0.000311 loss: 12.8984 (12.8393) pt_loss: 6.1091 (6.0516) poly_loss: 3.1889 (3.1327) rec_loss: 3.4215 (3.6550) time: 0.3092 data: 0.0097 max mem: 17167 Loss is nan, stopping training {'pt_loss': tensor(5.8353, device='cuda:0', grad_fn=<NllLoss2DBackward0>), 'poly_loss': tensor(nan, device='cuda:0', grad_fn=<NllLoss2DBackward0>), 'rec_loss': tensor(1.7320, device='cuda:0', grad_fn=<NllLoss2DBackward0>)}

Yang027 avatar May 25 '25 08:05 Yang027