RuntimeError: cuda runtime error (11) : invalid argument at /opt/conda/conda-bld/pytorch_1535493744281/work/aten/src/THC/THCGeneral.cpp:663
pytorch version:0.4.1
Thanks for your interest in our work.
Could you provide more details about the error so that I can help you? We use Pytorch 0.4.1 on our server but didn't find the issue.
pytorch version:0.4.1
set torch.backends.cudnn.benchmark = False
Thanks for your interest in our work.
Could you provide more details about the error so that I can help you? We use Pytorch 0.4.1 on our server but didn't find the issue.
loss is nan when training... can you help me?
Thanks for your interest in our work. Could you provide more details about the error so that I can help you? We use Pytorch 0.4.1 on our server but didn't find the issue.
loss is nan when training... can you help me?
the same issue, loss becomes NaN during first epoch
"Input contains NaN, infinity, or a value too large for dtype ('float64')" In the first epoch, loss becomes NaN. The error message is as above. Can you help me, thanks
Sorry for the late response.
I didn't see the same issues during my experiments. Could you provide more information like your environment and the more detailed error message?
My environment is Pytorch 0.4.1,1080Ti GPU,Python 3.7.4 and CUDA 9.2.148. The error may also come from the incorrectly installed Pytorch extension. I think you can try to train a supervised RSCNN based on their code (https://github.com/Yochengliu/Relation-Shape-CNN) to check whether the environment is correctly configured since PointGLR and RSCNN use a similar environment.
Sorry for the late response.
I didn't see the same issues during my experiments. Could you provide more information like your environment and the more detailed error message?
My environment is Pytorch 0.4.1,1080Ti GPU,Python 3.7.4 and CUDA 9.2.148. The error may also come from the incorrectly installed Pytorch extension. I think you can try to train a supervised RSCNN based on their code (https://github.com/Yochengliu/Relation-Shape-CNN) to check whether the environment is correctly configured since PointGLR and RSCNN use a similar environment.
Thank you for your help! My environment is Pytorch 0.4.1,2080Ti GPU,Python 3.6.13 and CUDA 11.4. I've been running RSCNN in this environment until now because my other work is based on it.
All messages are as follows:
[epoch 0: 0/615] metric/chamfer/normal loss: 6.763555/0.549132/1.000000 lr: 0.00143
[epoch 0: 20/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 40/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 60/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 80/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 100/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 120/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 140/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 160/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 180/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 200/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 220/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 240/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 260/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 280/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 300/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 320/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 340/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 360/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 380/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 400/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 420/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 440/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 460/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 480/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 500/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 520/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 540/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 560/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 580/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
[epoch 0: 600/615] metric/chamfer/normal loss: nan/nan/nan lr: 0.00143
Traceback (most recent call last):
File "/home/xgq/文档/PointGLR-master8.13/train.py", line 322, in
I also met NaN issue using this repo months ago, I simply switched to another machine with a different GPU (in my case, TITAN XP failed and TITAN V worked) and everything worked fine, but I'm still not sure whether it was hardware that caused this issue.
I also met NaN issue using this repo months ago, I simply switched to another machine with a different GPU (in my case, TITAN XP failed and TITAN V worked) and everything worked fine, but I'm still not sure whether it was hardware that caused this issue.
I just tried to turn up the batchsize, and it seems to work. When batchsize = 22, the first line is normal. When batchsize is increased, the number of normal lines is more and more. But my GPU can only set batchsize = 64 and still can't run normally
If the error is related to the batch size, NaN may be due to the contrastive learning loss, which is usually less stable than the supervised loss. I think you can try to use a smaller learning rate, use the torch.nn.functional.normalize(input, p=2, dim=1, eps=1e-12) with a larger eps to replace the Normalize method in this line, or use a smaller s=64 to avoid Float overflow. Since everything works well in my environment, I am not sure whether these tricks can help you.
If the error is related to the batch size, NaN may be due to the contrastive learning loss, which is usually less stable than the supervised loss. I think you can try to use a smaller learning rate, use the
torch.nn.functional.normalize(input, p=2, dim=1, eps=1e-12)with a larger eps to replace the Normalize method in this line, or use a smallers=64to avoid Float overflow. Since everything works well in my environment, I am not sure whether these tricks can help you.
I tried these methods, but they didn't work. Maybe I should try another device.