HigherHRNet-Human-Pose-Estimation
HigherHRNet-Human-Pose-Estimation copied to clipboard
Loss is NaN when using Mixed-precision training
Hi, I'm trying to reproduce your results but meet some problems. The configuration file I'm using is w48_640_adam_lr1e-3.yaml, which uses Mixed-precision training. During the training I found many NaNs occured, such as:
2019-12-12 15:32:31,243 Epoch: [40][1200/1603] Time: 1.162s (1.383s) Speed: 17.2 samples/s Data: 0.000s (0.191s) Stage0-heatmaps: nan (nan) Stage1-heatmaps: nan (nan) Stage0-push: nan (nan) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: nan (nan) Stage1-pull: 0.000e+00 (0.000e+00)
2019-12-12 15:34:47,608 Epoch: [40][1300/1603] Time: 1.177s (1.381s) Speed: 17.0 samples/s Data: 0.000s (0.189s) Stage0-heatmaps: nan (nan) Stage1-heatmaps: nan (nan) Stage0-push: nan (nan) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: nan (nan) Stage1-pull: 0.000e+00 (0.000e+00)
2019-12-12 15:37:04,760 Epoch: [40][1400/1603] Time: 1.322s (1.381s) Speed: 15.1 samples/s Data: 0.000s (0.188s) Stage0-heatmaps: 9.458e-03 (nan) Stage1-heatmaps: 4.662e-02 (nan) Stage0-push: 1.692e-04 (nan) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 1.046e-05 (nan) Stage1-pull: 0.000e+00 (0.000e+00)
2019-12-12 15:39:20,082 Epoch: [40][1500/1603] Time: 1.175s (1.379s) Speed: 17.0 samples/s Data: 0.000s (0.186s) Stage0-heatmaps: nan (nan) Stage1-heatmaps: nan (nan) Stage0-push: nan (nan) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: nan (nan) Stage1-pull: 0.000e+00 (0.000e+00)
2019-12-12 15:41:36,051 Epoch: [40][1600/1603] Time: 1.170s (1.378s) Speed: 17.1 samples/s Data: 0.000s (0.185s) Stage0-heatmaps: nan (nan) Stage1-heatmaps: nan (nan) Stage0-push: nan (nan) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: nan (nan) Stage1-pull: 0.000e+00 (0.000e+00)
2019-12-12 15:41:59,233 => saving checkpoint to output/coco_kpt/pose_higher_hrnet/w48_640_adam_lr1e-3
2019-12-12 15:42:35,559 Epoch: [41][0/1603] Time: 33.373s (33.373s) Speed: 0.6 samples/s Data: 32.120s (32.120s) Stage0-heatmaps: nan (nan) Stage1-heatmaps: nan (nan) Stage0-push: nan (nan) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: nan (nan) Stage1-pull: 0.000e+00 (0.000e+00)
2019-12-12 15:44:52,633 Epoch: [41][100/1603] Time: 1.175s (1.688s) Speed: 17.0 samples/s Data: 0.000s (0.481s) Stage0-heatmaps: nan (nan) Stage1-heatmaps: nan (nan) Stage0-push: nan (nan) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: nan (nan) Stage1-pull: 0.000e+00 (0.000e+00)
2019-12-12 15:47:09,428 Epoch: [41][200/1603] Time: 1.163s (1.529s) Speed: 17.2 samples/s Data: 0.000s (0.323s) Stage0-heatmaps: nan (nan) Stage1-heatmaps: nan (nan) Stage0-push: nan (nan) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: nan (nan) Stage1-pull: 0.000e+00 (0.000e+00)
2019-12-12 15:49:27,329 Epoch: [41][300/1603] Time: 1.268s (1.479s) Speed: 15.8 samples/s Data: 0.000s (0.272s) Stage0-heatmaps: 1.048e-02 (nan) Stage1-heatmaps: 5.480e-02 (nan) Stage0-push: 1.723e-04 (nan) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 1.051e-05 (nan) Stage1-pull: 0.000e+00 (0.000e+00)
2019-12-12 15:51:43,614 Epoch: [41][400/1603] Time: 1.177s (1.450s) Speed: 17.0 samples/s Data: 0.000s (0.247s) Stage0-heatmaps: nan (nan) Stage1-heatmaps: nan (nan) Stage0-push: nan (nan) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: nan (nan) Stage1-pull: 0.000e+00 (0.000e+00)
2019-12-12 15:54:01,107 Epoch: [41][500/1603] Time: 1.171s (1.435s) Speed: 17.1 samples/s Data: 0.000s (0.231s) Stage0-heatmaps: nan (nan) Stage1-heatmaps: nan (nan) Stage0-push: nan (nan) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: nan (nan) Stage1-pull: 0.000e+00 (0.000e+00)
2019-12-12 15:56:19,575 Epoch: [41][600/1603] Time: 1.196s (1.427s) Speed: 16.7 samples/s Data: 0.000s (0.221s) Stage0-heatmaps: nan (nan) Stage1-heatmaps: nan (nan) Stage0-push: nan (nan) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: nan (nan) Stage1-pull: 0.000e+00 (0.000e+00)
2019-12-12 15:58:39,054 Epoch: [41][700/1603] Time: 1.210s (1.422s) Speed: 16.5 samples/s Data: 0.000s (0.215s) Stage0-heatmaps: 7.745e-03 (nan) Stage1-heatmaps: 5.000e-02 (nan) Stage0-push: 2.685e-04 (nan) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 1.448e-05 (nan) Stage1-pull: 0.000e+00 (0.000e+00)
2019-12-12 16:00:57,280 Epoch: [41][800/1603] Time: 1.210s (1.417s) Speed: 16.5 samples/s Data: 0.000s (0.209s) Stage0-heatmaps: 8.945e-03 (nan) Stage1-heatmaps: 5.417e-02 (nan) Stage0-push: 2.414e-04 (nan) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 7.625e-06 (nan) Stage1-pull: 0.000e+00 (0.000e+00)
2019-12-12 16:03:16,186 Epoch: [41][900/1603] Time: 1.342s (1.414s) Speed: 14.9 samples/s Data: 0.000s (0.204s) Stage0-heatmaps: 8.899e-03 (nan) Stage1-heatmaps: 5.211e-02 (nan) Stage0-push: 2.192e-04 (nan) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 1.391e-05 (nan) Stage1-pull: 0.000e+00 (0.000e+00)
2019-12-12 16:05:34,076 Epoch: [41][1000/1603] Time: 1.350s (1.410s) Speed: 14.8 samples/s Data: 0.000s (0.200s) Stage0-heatmaps: 9.861e-03 (nan) Stage1-heatmaps: 5.311e-02 (nan) Stage0-push: 2.468e-04 (nan) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 1.238e-05 (nan) Stage1-pull: 0.000e+00 (0.000e+00)
2019-12-12 16:07:52,383 Epoch: [41][1100/1603] Time: 1.368s (1.408s) Speed: 14.6 samples/s Data: 0.000s (0.197s) Stage0-heatmaps: 1.027e-02 (nan) Stage1-heatmaps: 5.140e-02 (nan) Stage0-push: 2.198e-04 (nan) Stage1-push: 0.000e+00 (0.000e+00) Stage0-pull: 1.910e-05 (nan) Stage1-pull: 0.000e+00 (0.000e+00)
I'm training using 2 NVIDIA Tesla V100 GPU cards, so I modify IMAGES_PER_GPU from 10 to 20. Any other configuration/code remain the same. What really caused so many NaNs makes me very confused, hope for your help, thanks! @bowenc0221 @leoxiaobin
Same issue here. Suggestion is needed!
I have the same problem, does anyone have suggestion?