epos icon indicating copy to clipboard operation
epos copied to clipboard

Issue in train time

Open ji-min-song opened this issue 3 years ago • 2 comments

when i train, i got nan for loss

python train.py --model=lmo

step: 0 total_loss: 9.5576973 obj_cls: 2.77258897 frag_cls: 4.15888262 frag_loc: 2.37503433 step: 100 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: nan INFO:tensorflow:global_step/sec: 2.1272 step: 200 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: nan INFO:tensorflow:global_step/sec: 2.22972 step: 300 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: nan INFO:tensorflow:global_step/sec: 2.22798 step: 400 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: 2.52651024 INFO:tensorflow:global_step/sec: 2.22882 step: 500 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: nan INFO:tensorflow:global_step/sec: 2.23132 step: 600 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: 2.38968158 INFO:tensorflow:global_step/sec: 2.22965 step: 700 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: nan INFO:tensorflow:global_step/sec: 2.23278 step: 800 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: nan INFO:tensorflow:global_step/sec: 2.22892 step: 900 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: 2.19273663 INFO:tensorflow:global_step/sec: 2.22798 step: 1000 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: nan INFO:tensorflow:global_step/sec: 2.22868 step: 1100 total_loss: nan obj_cls: nan frag_cls: nan frag_loc: nan INFO:tensorflow:global_step/sec: 2.22729

so i think error generated for it

Caused by op 'logits/pred_frag_conf/weights_1', defined at: File "train.py", line 559, in tf.app.run() File "/home/default/anaconda3/envs/epos/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "train.py", line 485, in main freeze_regex_list=FLAGS.freeze_regex_list) File "train.py", line 355, in _train_epos_model reuse_variable=(i != 0)) File "train.py", line 267, in _tower_loss outputs_to_num_channels) File "train.py", line 239, in _build_epos_model tf.summary.histogram(model_var.op.name, model_var) File "/home/default/anaconda3/envs/epos/lib/python3.6/site-packages/tensorflow/python/summary/summary.py", line 187, in histogram tag=tag, values=values, name=scope) File "/home/default/anaconda3/envs/epos/lib/python3.6/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 284, in histogram_summary "HistogramSummary", tag=tag, values=values, name=name) File "/home/default/anaconda3/envs/epos/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/default/anaconda3/envs/epos/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func return func(*args, **kwargs) File "/home/default/anaconda3/envs/epos/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op op_def=op_def) File "/home/default/anaconda3/envs/epos/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in init self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Nan in summary histogram for: logits/pred_frag_conf/weights_1 [[node logits/pred_frag_conf/weights_1 (defined at train.py:239) = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](logits/pred_frag_conf/weights_1/tag, logits/pred_frag_conf/weights/read/_9035)]] [[{{node xception_65/middle_flow/block1/unit_3/xception_module/separable_conv2_depthwise/BatchNorm/moving_mean/read/_9950}} = _SendT=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1856_..._mean/read", _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

What can i do for training?

ji-min-song avatar May 24 '22 04:05 ji-min-song

InvalidArgumentError (see above for traceback): Nan in summary histogram for: xception_65/middle_flow/block1/unit_1/xception_module/separable_conv3_depthwise/BatchNorm/beta_1 [[node xception_65/middle_flow/block1/unit_1/xception_module/separable_conv3_depthwise/BatchNorm/beta_1 (defined at train.py:239) = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](xception_65/middle_flow/block1/unit_1/xception_module/separable_conv3_depthwise/BatchNorm/beta_1/tag, xception_65/middle_flow/block1/unit_1/xception_module/separable_conv3_depthwise/BatchNorm/beta/read/_8917)]]

"Nan in summary histogram for" occurs various part of model

ji-min-song avatar May 24 '22 06:05 ji-min-song

Hello ji-min-song, can i ask you whether you fix the error? Cause I got the same error...

cxym-yyn avatar Jul 04 '22 05:07 cxym-yyn