使用单gpu运行EPMF代码训练到10个epoch左右的时候loss等全部为nan
相同环境下,训练PMF模型没有问题,但是训练EPMF会出现值全为nan的情况,我重复训练了几次都会出现这种情况,我尝试保存最近的checkpoint文件,但是发现读取checkpoint并继续训练的时候会出现学习率并没有从之前的0.00095继续训练,而是从0.00001继续训练的,我不确定这会不会影响最终的训练效果。
这是我继续从epoch9开始训练的时候,学习率不一致的截图
Hello, may I ask what model of graphics card you used for training?
Hello, may I ask what model of graphics card you used for training?
I encountered this issue when training with a single RTX4090 . Subsequent training with a100-40GB4 or vGPU-32GB8 on the KITTI dataset did not produce this problem. However, on the Nuscenes dataset, the loss still becomes NaN at a certain epoch.
Thank you very much for your reply. It was very helpful to me. I have another question now: How long did it take you to train KITTI using vGPU32GB? If I use four 3090s to train KITTI, approximately how long will it take? I look forward to your reply.
Thank you very much for your reply. It was very helpful to me. I have another question now: How long did it take you to train KITTI using vGPU32GB? If I use four 3090s to train KITTI, approximately how long will it take? I look forward to your reply.
In reality, I only used a100-40GB * 4 to train the model on the Kitti dataset, which took about 70 hours. Using a single RTX 4090 GPU was expected to take even less time, but it encountered a loss of NaN error midway through training, so I don't know the exact actual training time. On the Nuscenes dataset, training with a100-40GB * 4 was estimated to take 120 days, so I abandoned that approach. Subsequent attempts using vGPU 32GB * 8 also took approximately 8 days, but it also encountered a loss of NaN error midway through training.
Thank you for your explanation. Could you please answer one last question? The author used the SemanticKITTI-FOV dataset. Does this SemanticKITTI-FOV need to be preprocessed by us? I don't think I found the script for processing SemanticKITTI-FOV in the project.
Sorry to bother you. I found that script file. Thank you again for your help.
Hello, I'm sorry to bother you again. I trained the EPMF model using KITTI, but the final mIoU is only 63.02. I wanted to ask if you have encountered this issue?