training on coco dataset
hi, unsky Did you have ever trained FPN on coco dataset? when I did this work, there are some errors. Like, ../lib/rpn/proposal_layer.py", line 209, in forward pad = npr.choice(keep, size=int(post_nms_topN) - len(keep)) File "mtrand.pyx", line 1121, in mtrand.RandomState.choice ValueError: a must be non-empty
and, ../lib/fast_rcnn/bbox_transform.py:50: RuntimeWarning: overflow encountered in exp pred_h = np.exp(dh) * heights[:, np.newaxis]
thanks.
when allow-border value is small , the error will occur in small image. you can increase allow-border. or increase image size.
anchor size use the setting in paper. you can ignore the warning , it only occur in first itoration.
I used larger allow_border value, while the error still occurred.
@redrabbit0723 https://github.com/unsky/FPN/blob/2f3e5c39452ac89217a40ab1174b39fa424d71cf/lib/rpn/proposal_layer.py#L213 add this codes.
I can train successfully, but train loss is nan.
I0111 11:27:49.281353 10439 solver.cpp:230] Iteration 100, loss = nan I0111 11:27:49.281397 10439 solver.cpp:246] Train net output #0: FPNClsLoss = 87.3365 (* 1 = 87.3365 loss) I0111 11:27:49.281404 10439 solver.cpp:246] Train net output #1: FPNLossBBox = nan (* 1 = nan loss) I0111 11:27:49.281410 10439 solver.cpp:246] Train net output #2: RcnnLossBBox = nan (* 1 = nan loss) I0111 11:27:49.281415 10439 solver.cpp:246] Train net output #3: RcnnLossCls = 87.3365 (* 1 = 87.3365 loss) I0111 11:27:49.281421 10439 sgd_solver.cpp:107] Iteration 100, lr = 0.02 I0111 11:28:14.633100 10439 solver.cpp:230] Iteration 120, loss = nan I0111 11:28:14.633174 10439 solver.cpp:246] Train net output #0: FPNClsLoss = 87.3365 (* 1 = 87.3365 loss) I0111 11:28:14.633184 10439 solver.cpp:246] Train net output #1: FPNLossBBox = nan (* 1 = nan loss) I0111 11:28:14.633190 10439 solver.cpp:246] Train net output #2: RcnnLossBBox = nan (* 1 = nan loss) I0111 11:28:14.633196 10439 solver.cpp:246] Train net output #3: RcnnLossCls = 87.3365 (* 1 = 87.3365 loss) I0111 11:28:14.633203 10439 sgd_solver.cpp:107] Iteration 120, lr = 0.02 I0111 11:28:41.964567 10439 solver.cpp:230] Iteration 140, loss = nan I0111 11:28:41.964624 10439 solver.cpp:246] Train net output #0: FPNClsLoss = 87.3365 (* 1 = 87.3365 loss) I0111 11:28:41.964634 10439 solver.cpp:246] Train net output #1: FPNLossBBox = nan (* 1 = nan loss) I0111 11:28:41.964640 10439 solver.cpp:246] Train net output #2: RcnnLossBBox = nan (* 1 = nan loss) I0111 11:28:41.964646 10439 solver.cpp:246] Train net output #3: RcnnLossCls = 87.3365 (* 1 = 87.3365 loss) I0111 11:28:41.964653 10439 sgd_solver.cpp:107] Iteration 140, lr = 0.02 I0111 11:29:08.411581 10439 solver.cpp:230] Iteration 160, loss = nan I0111 11:29:08.411650 10439 solver.cpp:246] Train net output #0: FPNClsLoss = 87.3365 (* 1 = 87.3365 loss) I0111 11:29:08.411666 10439 solver.cpp:246] Train net output #1: FPNLossBBox = nan (* 1 = nan loss) I0111 11:29:08.411675 10439 solver.cpp:246] Train net output #2: RcnnLossBBox = nan (* 1 = nan loss) I0111 11:29:08.411685 10439 solver.cpp:246] Train net output #3: RcnnLossCls = 87.3365 (* 1 = 87.3365 loss) I0111 11:29:08.411716 10439 sgd_solver.cpp:107] Iteration 160, lr = 0.02
..................0.02
It's ok with 0.001. Thanks a lot.
hi,when i train on the pascal_voc: root@hncs-MS-7817:/home/hncs/liuwei/FPN/experiments/scripts# ./FP_Net_end2end.sh 0 ResNet-50 pascal_voc ++ set -e ++ export PYTHONUNBUFFERED=True ++ PYTHONUNBUFFERED=True ++ GPU_ID=0 ++ NET=ResNet-50 ++ NET_lc=resnet-50 ++ DATASET=pascal_voc ++ array=($@) ++ len=3 ++ EXTRA_ARGS= ++ EXTRA_ARGS_SLUG= ++ case $DATASET in ++ TRAIN_IMDB=voc_2007_trainval ++ TEST_IMDB=voc_2007_test ++ PT_DIR=pascal_voc ++ ITERS=250000 +++ date +%Y-%m-%d_%H-%M-%S ++ LOG=experiments/logs/faster_rcnn_end2end_ResNet-50_.txt.2018-01-22_17-00-56 ++ exec +++ tee -a experiments/logs/faster_rcnn_end2end_ResNet-50_.txt.2018-01-22_17-00-56 tee: experiments/logs/faster_rcnn_end2end_ResNet-50_.txt.2018-01-22_17-00-56: No such file or directory ++ echo Logging output to experiments/logs/faster_rcnn_end2end_ResNet-50_.txt.2018-01-22_17-00-56 Logging output to experiments/logs/faster_rcnn_end2end_ResNet-50_.txt.2018-01-22_17-00-56 ++ ./tools/train_net.py --gpu 0 --solver models/pascal_voc/ResNet-50/FP_Net_end2end/solver.prototxt --weights data/pretrained_model/ResNet50.v2.caffemodel --imdb voc_2007_trainval --iters 250000 --cfg experiments/cfgs/FP_Net_end2end.yml ./FP_Net_end2end.sh: line 51: ./tools/train_net.py: No such file or directory But i have the file
thank you very much @unsky @redrabbit0723
@unsky "when allow-border value is small , the error will occur in small image.you can increase allow-border. or increase image size." 为什么当allow-border value is small , 就会在小图片中出现这个问题呢?? 麻烦了!
Hi @unsky, I have doubt about a duplicate of the same condition in proposal_layer.py in line no 208 and 213. 208: # pad to ensure output size remains unchanged if len(keep) < post_nms_topN: and 213: # pad to ensure output size remains unchanged if len(keep) < post_nms_topN: Should I disable the first condition in 208?
@shohan6 the code in line213 is written for process when rpn net has proposal zero roi.
@shohan6 yes, they are duplicate of the same condition. you can disable the first.
Hi unsky, I tried to train your provided fpn net ot SUN-RGBD dataset on RGB channel. But, I always get nan loss at some point. I use learning rate = 0.001
I1107 18:16:05.126844 16764 solver.cpp:229] Iteration 1520, loss = 1.47688 I1107 18:16:05.126874 16764 solver.cpp:245] Train net output #0: FPNClsLoss = 0.63313 (* 1 = 0.63313 loss) I1107 18:16:05.126880 16764 solver.cpp:245] Train net output #1: FPNLossBBox = 0.108775 (* 1 = 0.108775 loss) I1107 18:16:05.126884 16764 solver.cpp:245] Train net output #2: loss_bbox = 0.120012 (* 1 = 0.120012 loss) I1107 18:16:05.126888 16764 solver.cpp:245] Train net output #3: loss_cls = 0.773154 (* 1 = 0.773154 loss) I1107 18:16:05.126893 16764 sgd_solver.cpp:107] Iteration 1520, lr = 0.001 I1107 18:16:31.770622 16764 solver.cpp:229] Iteration 1540, loss = 1.18369e+08 I1107 18:16:31.770653 16764 solver.cpp:245] Train net output #0: FPNClsLoss = 30.3631 (* 1 = 30.3631 loss) I1107 18:16:31.770659 16764 solver.cpp:245] Train net output #1: FPNLossBBox = 6.17318e+07 (* 1 = 6.17318e+07 loss) I1107 18:16:31.770664 16764 solver.cpp:245] Train net output #2: loss_bbox = 1.29945e+08 (* 1 = 1.29945e+08 loss) I1107 18:16:31.770668 16764 solver.cpp:245] Train net output #3: loss_cls = 29.1122 (* 1 = 29.1122 loss) I1107 18:16:31.770673 16764 sgd_solver.cpp:107] Iteration 1540, lr = 0.001 /mnt/Programs/CaffeCoding/faster_rcnn/PythonVersion/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:97: RuntimeWarning: overflow encountered in multiply pred_ctr_x = dx * widths[:, np.newaxis] + ctr_x[:, np.newaxis] /mnt/Programs/CaffeCoding/faster_rcnn/PythonVersion/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:98: RuntimeWarning: overflow encountered in multiply pred_ctr_y = dy * heights[:, np.newaxis] + ctr_y[:, np.newaxis] I1107 18:16:56.319780 16764 solver.cpp:229] Iteration 1560, loss = nan I1107 18:16:56.319821 16764 solver.cpp:245] Train net output #0: FPNClsLoss = 87.3365 (* 1 = 87.3365 loss) I1107 18:16:56.319828 16764 solver.cpp:245] Train net output #1: FPNLossBBox = nan (* 1 = nan loss) I1107 18:16:56.319831 16764 solver.cpp:245] Train net output #2: loss_bbox = nan (* 1 = nan loss) I1107 18:16:56.319836 16764 solver.cpp:245] Train net output #3: loss_cls = 87.3365 (* 1 = 87.3365 loss) I1107 18:16:56.319841 16764 sgd_solver.cpp:107] Iteration 1560, lr = 0.001 I1107 18:17:20.768867 16764 solver.cpp:229] Iteration 1580, loss = nan I1107 18:17:20.768900 16764 solver.cpp:245] Train net output #0: FPNClsLoss = 87.3365 (* 1 = 87.3365 loss) I1107 18:17:20.768908 16764 solver.cpp:245] Train net output #1: FPNLossBBox = nan (* 1 = nan loss) I1107 18:17:20.768911 16764 solver.cpp:245] Train net output #2: loss_bbox = nan (* 1 = nan loss) I1107 18:17:20.768915 16764 solver.cpp:245] Train net output #3: loss_cls = 87.3365 (* 1 = 87.3365 loss) I1107 18:17:20.768919 16764 sgd_solver.cpp:107] Iteration 1580, lr = 0.001 speed: 1.344s / iter I1107 18:17:45.256358 16764 solver.cpp:229] Iteration 1600, loss = nan I1107 18:17:45.256387 16764 solver.cpp:245] Train net output #0: FPNClsLoss = 87.3365 (* 1 = 87.3365 loss) I1107 18:17:45.256394 16764 solver.cpp:245] Train net output #1: FPNLossBBox = nan (* 1 = nan loss) I1107 18:17:45.256398 16764 solver.cpp:245] Train net output #2: loss_bbox = nan (* 1 = nan loss) I1107 18:17:45.256402 16764 solver.cpp:245] Train net output #3: loss_cls = 87.3365 (* 1 = 87.3365 loss) I1107 18:17:45.256417 16764 sgd_solver.cpp:107] Iteration 1600, lr = 0.001 I1107 18:18:09.877689 16764 solver.cpp:229] Iteration 1620, loss = nan I1107 18:18:09.877719 16764 solver.cpp:245] Train net output #0: FPNClsLoss = 87.3365 (* 1 = 87.3365 loss) I1107 18:18:09.877727 16764 solver.cpp:245] Train net output #1: FPNLossBBox = nan (* 1 = nan loss) I1107 18:18:09.877730 16764 solver.cpp:245] Train net output #2: loss_bbox = nan (* 1 = nan loss) I1107 18:18:09.877734 16764 solver.cpp:245] Train net output #3: loss_cls = 87.3365 (* 1 = 87.3365 loss) I1107 18:18:09.877740 16764 sgd_solver.cpp:107] Iteration 1620, lr = 0.001
What is going wrong with the training?