Issue in train time
when i train, i got nan for loss
python train.py --model=lmo
step: 0 total_loss: 9.5576973 obj_cls: 2.77258897 frag_cls: 4.15888262 frag_loc: 2.37503433 step: 100 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: nan INFO:tensorflow:global_step/sec: 2.1272 step: 200 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: nan INFO:tensorflow:global_step/sec: 2.22972 step: 300 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: nan INFO:tensorflow:global_step/sec: 2.22798 step: 400 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: 2.52651024 INFO:tensorflow:global_step/sec: 2.22882 step: 500 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: nan INFO:tensorflow:global_step/sec: 2.23132 step: 600 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: 2.38968158 INFO:tensorflow:global_step/sec: 2.22965 step: 700 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: nan INFO:tensorflow:global_step/sec: 2.23278 step: 800 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: nan INFO:tensorflow:global_step/sec: 2.22892 step: 900 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: 2.19273663 INFO:tensorflow:global_step/sec: 2.22798 step: 1000 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: nan INFO:tensorflow:global_step/sec: 2.22868 step: 1100 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: nan INFO:tensorflow:global_step/sec: 2.22729
so i think error generated for it
Caused by op 'logits/pred_frag_conf/weights_1', defined at:
File "train.py", line 559, in
InvalidArgumentError (see above for traceback): Nan in summary histogram for: logits/pred_frag_conf/weights_1 [[node logits/pred_frag_conf/weights_1 (defined at train.py:239) = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](logits/pred_frag_conf/weights_1/tag, logits/pred_frag_conf/weights/read/_9035)]] [[{{node xception_65/middle_flow/block1/unit_3/xception_module/separable_conv2_depthwise/BatchNorm/moving_mean/read/_9950}} = _SendT=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1856_..._mean/read", _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
What can i do for training?
InvalidArgumentError (see above for traceback): Nan in summary histogram for: xception_65/middle_flow/block1/unit_1/xception_module/separable_conv3_depthwise/BatchNorm/beta_1 [[node xception_65/middle_flow/block1/unit_1/xception_module/separable_conv3_depthwise/BatchNorm/beta_1 (defined at train.py:239) = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](xception_65/middle_flow/block1/unit_1/xception_module/separable_conv3_depthwise/BatchNorm/beta_1/tag, xception_65/middle_flow/block1/unit_1/xception_module/separable_conv3_depthwise/BatchNorm/beta/read/_8917)]]
"Nan in summary histogram for" occurs various part of model
Hello ji-min-song, can i ask you whether you fix the error? Cause I got the same error...