NaN question

Open lywang76 opened this issue 3 years ago • 2 comments

During the pretraining for imageNet data, I got the Nan error for epoch 186. [10:41:36.508034] Loss is nan, stopping training

Can you explain how I should fix this error?

Aug 13 '22 14:08 lywang76

Switch to FP 32 optimization by resuming from the nearest checkpoint.

Aug 13 '22 15:08 gaopengpjlab

delete this line (with torch.cuda.amp.autocast():) to close fp16

Aug 13 '22 16:08 gaopengpjlab