Training stuck at 8%
We are training a model with 83 thousand pictures of (2592x1944 ) of 200 different supermarket products in 200 classes.
To simplify, we created a single class for all products and put that single class in all XML files
We are training it on a Tesla V100 with 32 Gb and the command flow --model /root/convert/meu.cfg --train --annotation /root/convert/train --dataset /root/retail/images --gpu 0.9 --batch 50 --lr 0.01 --trainer adam
The cfg file has width=832 height=832 filters=30
No matter if we change the learning rate or the number of pictures per batch, the training gets stuck at around 8% and will not go down then
Any suggestions on how can we go pass this snag? Thank you
@Sequential-circuits I tried to fix this problem by varying lr periodically (as explained in th link below), specially when it stays unchanged for a while. also how sometimes it needs to be patient and higher iterations solves it.
https://blog.paperspace.com/intro-to-optimization-in-deep-learning-gradient-descent/
Thanks, you are right and actually, that´s what we did: we just let it run for some days and now we are down to 1.45%. We froze the model and it seems to work, and we let it run for some more. My question now would be what would be the usual loss people consider the model converged?
@Sequential-circuits it is supposed to reach very close to zero (lower than 0.1). You can do validation after some iterations to check the the function of your model, even before your loss be close to zero. Sometimes it may result in good detection even with relatively higher loss values.
Thank you: we let it run for some more and it would ping pong between 1 and 2, so since it seems to work well we consider it trained.