darkflow icon indicating copy to clipboard operation
darkflow copied to clipboard

Training stuck at 8%

Open Sequential-circuits opened this issue 6 years ago • 4 comments

We are training a model with 83 thousand pictures of (2592x1944 ) of 200 different supermarket products in 200 classes.

To simplify, we created a single class for all products and put that single class in all XML files

We are training it on a Tesla V100 with 32 Gb and the command flow --model /root/convert/meu.cfg --train --annotation /root/convert/train --dataset /root/retail/images --gpu 0.9 --batch 50 --lr 0.01 --trainer adam

The cfg file has width=832 height=832 filters=30

No matter if we change the learning rate or the number of pictures per batch, the training gets stuck at around 8% and will not go down then

Any suggestions on how can we go pass this snag? Thank you

Sequential-circuits avatar Apr 10 '20 09:04 Sequential-circuits

@Sequential-circuits I tried to fix this problem by varying lr periodically (as explained in th link below), specially when it stays unchanged for a while. also how sometimes it needs to be patient and higher iterations solves it.

https://blog.paperspace.com/intro-to-optimization-in-deep-learning-gradient-descent/

mfaramarzi avatar Apr 15 '20 20:04 mfaramarzi

Thanks, you are right and actually, that´s what we did: we just let it run for some days and now we are down to 1.45%. We froze the model and it seems to work, and we let it run for some more. My question now would be what would be the usual loss people consider the model converged?

Sequential-circuits avatar Apr 16 '20 07:04 Sequential-circuits

@Sequential-circuits it is supposed to reach very close to zero (lower than 0.1). You can do validation after some iterations to check the the function of your model, even before your loss be close to zero. Sometimes it may result in good detection even with relatively higher loss values.

mfaramarzi avatar Apr 18 '20 20:04 mfaramarzi

Thank you: we let it run for some more and it would ping pong between 1 and 2, so since it seems to work well we consider it trained.

Sequential-circuits avatar Apr 24 '20 14:04 Sequential-circuits