Non-decreasing loss for custom dataset
Hello,
Thanks for this great work. I try to train this work with my own dataset. Because of image size(3x224x224), I used batch_size 256, and here is my training loss:
Train: [1][1/46] BT 23.350 (23.350) DT 9.071 (9.071) loss 11.166 (11.166) Train: [1][2/46] BT 16.007 (19.679) DT 15.592 (12.332) loss 8.922 (10.044) Train: [1][3/46] BT 8.364 (15.907) DT 7.961 (10.875) loss 8.911 (9.666) Train: [1][4/46] BT 8.639 (14.090) DT 8.236 (10.215) loss 8.910 (9.477) Train: [1][5/46] BT 12.832 (13.838) DT 12.427 (10.657) loss 8.909 (9.364) Train: [1][6/46] BT 12.243 (13.573) DT 11.799 (10.848) loss 8.909 (9.288) Train: [1][7/46] BT 8.403 (12.834) DT 7.996 (10.440) loss 8.909 (9.234) Train: [1][8/46] BT 8.497 (12.292) DT 8.061 (10.143) loss 8.909 (9.193) Train: [1][9/46] BT 15.428 (12.640) DT 15.016 (10.685) loss 8.909 (9.162) Train: [1][10/46] BT 8.534 (12.230) DT 8.121 (10.428) loss 8.909 (9.136) Train: [1][11/46] BT 8.489 (11.890) DT 8.087 (10.215) loss 8.909 (9.116) Train: [1][12/46] BT 8.604 (11.616) DT 8.170 (10.045) loss 8.909 (9.098) Train: [1][13/46] BT 8.440 (11.372) DT 8.033 (9.890) loss 8.909 (9.084) Train: [1][14/46] BT 10.341 (11.298) DT 9.934 (9.893) loss 8.909 (9.071) Train: [1][15/46] BT 9.609 (11.185) DT 9.208 (9.848) loss 8.909 (9.061) Train: [1][16/46] BT 10.157 (11.121) DT 9.739 (9.841) loss 8.909 (9.051) Train: [1][17/46] BT 8.580 (10.972) DT 8.173 (9.743) loss 8.909 (9.043) Train: [1][18/46] BT 9.639 (10.897) DT 9.236 (9.715) loss 8.909 (9.035) Train: [1][19/46] BT 8.364 (10.764) DT 7.926 (9.620) loss 8.909 (9.029) Train: [1][20/46] BT 11.902 (10.821) DT 11.496 (9.714) loss 8.909 (9.023) Train: [1][21/46] BT 9.753 (10.770) DT 9.351 (9.697) loss 8.909 (9.017) Train: [1][22/46] BT 8.304 (10.658) DT 7.893 (9.615) loss 8.909 (9.012) Train: [1][23/46] BT 9.507 (10.608) DT 9.089 (9.592) loss 8.909 (9.008) Train: [1][24/46] BT 9.824 (10.575) DT 9.414 (9.585) loss 8.909 (9.004)
The loss was always stuck at 8.909. What is the reason for this? How can I fix it?
Thanks in advance.
i've seen similar behavior in the pre-training stage. just curious: are the above results for pre-training? one thing that tended to help avoid pre-training plateaus was increasing batch size. if your dataset is similar to imagenet, try a batch size of at least 1024. no guarantees of course...
hi, I also encountered this problem. Have you solved it?
I had the same problem. But I found that the pre-training plateaus continues for several epochs and then the loss falls down. How to solve it?
I couldn't solve it, still have this problem.
Hey @Bedrettin-Cetinkaya , have you solved it? I have a similar problem on a different dataset
Can batch size 1024 be solved?
I had the exact same issue, after one batch the loss saturates and remains constant. Interestingly, it remained constant despite a large learning rate. Here are a few things you can do to fix it:
- Reduce the batch size. It allows for more variance within the batch and help to get out of local minima.
- Make sure your custom dataset is normalized. You should provide a mean and std when you run the script, otherwise you might have erratic gradients to will throw off the network and stall it on a plateau.
- Reduce the learning rate, the default learning rate is made for large batch size. Reducing it might help.
- Try another backbone, I believe something might be wrong in the current implementation of the ResNet model used here. Indeed using a model from the
timmlibrary I got a very different behaviour. (This last step really helped me).
@hugovergnes Thanks for your suggestion, reduce the batch size works for me.
I think I've figured it out. The --cosine param in readme file is working with --warm, but in the config function parse_option, if batch_size>256, then warm will be true, but if batch_size is less than or equal to 256, then warm is turned off by default.
In short, append --warm , the loss will decrease as designed.
I couldn't solve it, still have this problem.
Have you solved this question? I use supconloss for my dataset for batchsize=128 and loss don't decrease