CAT-Net icon indicating copy to clipboard operation
CAT-Net copied to clipboard

NaN

Open dan326326 opened this issue 4 years ago • 13 comments

Thank you again for your answer. Now there is a new problem, Loss is NaN at the beginning of training. Do you know how to solve this problem?

Epoch: [1/200] Iter:[280/934], Time: 7.62, lr: 0.0049, Loss: nan NaN or Inf found in input tensor.

dan326326 avatar Nov 14 '21 08:11 dan326326

Hi, I cannot infer errors from this one sentence. You should debug carefully. Follow your input image and find what makes nan.

CauchyComplete avatar Nov 14 '21 12:11 CauchyComplete

Thank you for your reply. Is there a uniform size requirement for pictures

dan326326 avatar Nov 14 '21 12:11 dan326326

It is recommended to use images larger than 512x512, but smaller ones are okay if their portion is not large. You don’t need fixed size images because they will be cropped automatically to 512x512.

CauchyComplete avatar Nov 15 '21 17:11 CauchyComplete

One possible reason is that some of your images might be actually non-JPEG but have .jpg extension. Or some images might be just corrupted. Modify your code to print a filename when an error occurs and remove that image from training set.

CauchyComplete avatar Nov 15 '21 17:11 CauchyComplete

Thank you very much indeed. I located the generation of this nan, and the data became nan after passing through the first convolution layer conv1,What's going on here

dan326326 avatar Nov 16 '21 06:11 dan326326

That’s weird…

CauchyComplete avatar Nov 17 '21 15:11 CauchyComplete

You may post the image that is causing errors.

CauchyComplete avatar Nov 18 '21 07:11 CauchyComplete

hello, thank for your patient reply , this is the running result ! Epoch: [0/200] Iter:[0/1176], Time: 1100.00, lr: 0.005000, Loss: 0.691021 Epoch: [0/200] Iter:[10/1176], Time: 101.08, lr: 0.005000, Loss: nan NaN or Inf found in input tensor. Epoch: [0/200] Iter:[20/1176], Time: 53.51, lr: 0.005000, Loss: nan NaN or Inf found in input tensor. ...

dan326326 avatar Nov 20 '21 07:11 dan326326

I find is that the result of convolution is huge:

tensor([[[[-7.9044e+31, 1.3967e+32, -1.7841e+32, ..., 1.3038e+32, -3.7658e+32, 3.4734e+32], [-1.0282e+32, -8.2208e+31, -2.6121e+30, ..., 1.0440e+32, 8.8422e+31, 7.1321e+30], [ 3.3423e+31, 1.6554e+32, 4.9708e+31, ..., -2.4600e+32, -5.2235e+32, 2.3235e+32], ...,

dan326326 avatar Nov 20 '21 13:11 dan326326

Please upload that image file. I'll test it on my computer.

CauchyComplete avatar Nov 24 '21 05:11 CauchyComplete

Screenshot 2023-08-16 182710 Same the case with me for CASIA2 dataset. Any solution?

FathUMinUllah3797 avatar Aug 16 '23 17:08 FathUMinUllah3797

Screenshot 2023-08-16 182710 Same the case with me for CASIA2 dataset. Any solution?

Has this problem been resolved? I am also experiencing the same issue. PLZ

TwitchOnly111 avatar Oct 20 '24 09:10 TwitchOnly111