SoftTeacher icon indicating copy to clipboard operation
SoftTeacher copied to clipboard

训练中途loss变nan

Open baigaoyi opened this issue 4 years ago • 6 comments

您好!我制作了自己的数据集,一半sup,一半unsup,使用bash/tool/dist_train.sh {config} {gps_num}进行训练,config用的1080k。我在iterations=60000的时候中断过一次训练;后来为了继续训练,修改了dist_train.sh脚本,在执行train.py命令行加入了”--resum-from iter_60000.pth“。我查看了日志,从iterations60000到iterations444500训练都正常,但是从这之后,就出现了loss=nan的情况,而且在评估阶段出现“mmdet.ssod - ERROR - The testing results of the whole dataset is empty”错误,我检查了我的test数据集,也不是空的,请问这是什么原因呢?loss为nan以及datasets为empty的log截图如下: loss_nantest_dataset_empty

baigaoyi avatar Nov 01 '21 13:11 baigaoyi

loss变成NaN的时候网络已经崩了,所以后面才会出现测试结果为空的情况。 你可以先看看整体的loss curve是不是正常的,如果前期一直正常那么可以看看fp16的问题,你可以切换到fp32看有没有问题。 (如果用的torch<=1.6,最好切换到更高版本,因为torch<=1.6的时候fp16用的是mmdetection自己实现的版本,我之前也遇到过nan;如果没办法切换的话我可以给你一个apex版本的) 其次,你可以看看debug一下,从崩的节点附近resume,看看能不能在要崩的时候保存一下一些关键信息,看下哪些比较异常。

MendelXu avatar Nov 01 '21 13:11 MendelXu

image 看了一下wandb的可视化结果,loss curve是不正常的。并且我用的torch1.10.0。 然后崩之前那个模型已经完美错过了,pth只保存到iter_448000.pth,好气。。。感觉要从头开始训练了,,,我之前有完成过同样配置的训练,唯一不同的是训练集增大了一倍。数据集增大感觉不太会让网络崩了。我试试看手动调整一下fp到32,从头开始训练,看看有没有问题。感谢大神的回复!

baigaoyi avatar Nov 01 '21 15:11 baigaoyi

I had similar problem with loss. I received nan and the same mensage of dataset empty too. The solution for me was using bigger batch size changing workers_per_gpu like this issue: #9

DaniloAlves1995 avatar Nov 04 '21 13:11 DaniloAlves1995

I faced the same problem , I changed the batch size from 2 to 8 in this line: samples_per_gpu=8, workers_per_gpu=2,

and delete/comment the fp16 line as bellow #fp16 = dict(loss_scale="dynamic")

but unfortunately I got this error: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 725186) of binary: /opt/conda/bin/python

can any one help me. I really don't know what is exactly the problem.

alaa-shubbak avatar Jul 05 '22 17:07 alaa-shubbak

when changing the worker info to

samples_per_gpu=5, workers_per_gpu=2,

and i kept using fp16

I got another type of error which is ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed and i am still confused what is exactly the problem.

I used this training with 2 gpu's bash tools/dist_train_partially_faster_rcnn50_att_90_4classes.sh semi 1 10 2

alaa-shubbak avatar Jul 05 '22 17:07 alaa-shubbak

将工作人员信息更改为

samples_per_gpu=5, workers_per_gpu=2,

我一直在使用 fp16

我遇到了另一种类型的错误, ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed 我仍然很困惑到底是什么问题。

我将此培训与 2 个 gpu 一起使用 bash tools/dist_train_partially_faster_rcnn50_att_90_4classes.sh semi 1 10 2

have you solved it?

sqdcbbj avatar Dec 19 '22 11:12 sqdcbbj