PMF icon indicating copy to clipboard operation
PMF copied to clipboard

使用单gpu运行EPMF代码训练到10个epoch左右的时候loss等全部为nan

Open IKAROS66 opened this issue 1 year ago • 8 comments

IKAROS66 avatar Jan 15 '25 14:01 IKAROS66

相同环境下,训练PMF模型没有问题,但是训练EPMF会出现值全为nan的情况,我重复训练了几次都会出现这种情况,我尝试保存最近的checkpoint文件,但是发现读取checkpoint并继续训练的时候会出现学习率并没有从之前的0.00095继续训练,而是从0.00001继续训练的,我不确定这会不会影响最终的训练效果。 image 这是我继续从epoch9开始训练的时候,学习率不一致的截图 image

IKAROS66 avatar Jan 15 '25 14:01 IKAROS66

Hello, may I ask what model of graphics card you used for training?

zuozuoguai1 avatar Nov 27 '25 07:11 zuozuoguai1

Hello, may I ask what model of graphics card you used for training?

I encountered this issue when training with a single RTX4090 . Subsequent training with a100-40GB4 or vGPU-32GB8 on the KITTI dataset did not produce this problem. However, on the Nuscenes dataset, the loss still becomes NaN at a certain epoch.

IKAROS66 avatar Nov 27 '25 07:11 IKAROS66

Thank you very much for your reply. It was very helpful to me. I have another question now: How long did it take you to train KITTI using vGPU32GB? If I use four 3090s to train KITTI, approximately how long will it take? I look forward to your reply.

zuozuoguai1 avatar Nov 27 '25 07:11 zuozuoguai1

Thank you very much for your reply. It was very helpful to me. I have another question now: How long did it take you to train KITTI using vGPU32GB? If I use four 3090s to train KITTI, approximately how long will it take? I look forward to your reply.

In reality, I only used a100-40GB * 4 to train the model on the Kitti dataset, which took about 70 hours. Using a single RTX 4090 GPU was expected to take even less time, but it encountered a loss of NaN error midway through training, so I don't know the exact actual training time. On the Nuscenes dataset, training with a100-40GB * 4 was estimated to take 120 days, so I abandoned that approach. Subsequent attempts using vGPU 32GB * 8 also took approximately 8 days, but it also encountered a loss of NaN error midway through training.

IKAROS66 avatar Nov 27 '25 07:11 IKAROS66

Thank you for your explanation. Could you please answer one last question? The author used the SemanticKITTI-FOV dataset. Does this SemanticKITTI-FOV need to be preprocessed by us? I don't think I found the script for processing SemanticKITTI-FOV in the project.

zuozuoguai1 avatar Nov 27 '25 08:11 zuozuoguai1

Sorry to bother you. I found that script file. Thank you again for your help.

zuozuoguai1 avatar Nov 27 '25 08:11 zuozuoguai1

Hello, I'm sorry to bother you again. I trained the EPMF model using KITTI, but the final mIoU is only 63.02. I wanted to ask if you have encountered this issue?

zuozuoguai1 avatar Nov 29 '25 02:11 zuozuoguai1