SupContrast icon indicating copy to clipboard operation
SupContrast copied to clipboard

Non-decreasing loss for custom dataset

Open Bedrettin-Cetinkaya opened this issue 4 years ago • 10 comments

Hello,

Thanks for this great work. I try to train this work with my own dataset. Because of image size(3x224x224), I used batch_size 256, and here is my training loss:

Train: [1][1/46] BT 23.350 (23.350) DT 9.071 (9.071) loss 11.166 (11.166) Train: [1][2/46] BT 16.007 (19.679) DT 15.592 (12.332) loss 8.922 (10.044) Train: [1][3/46] BT 8.364 (15.907) DT 7.961 (10.875) loss 8.911 (9.666) Train: [1][4/46] BT 8.639 (14.090) DT 8.236 (10.215) loss 8.910 (9.477) Train: [1][5/46] BT 12.832 (13.838) DT 12.427 (10.657) loss 8.909 (9.364) Train: [1][6/46] BT 12.243 (13.573) DT 11.799 (10.848) loss 8.909 (9.288) Train: [1][7/46] BT 8.403 (12.834) DT 7.996 (10.440) loss 8.909 (9.234) Train: [1][8/46] BT 8.497 (12.292) DT 8.061 (10.143) loss 8.909 (9.193) Train: [1][9/46] BT 15.428 (12.640) DT 15.016 (10.685) loss 8.909 (9.162) Train: [1][10/46] BT 8.534 (12.230) DT 8.121 (10.428) loss 8.909 (9.136) Train: [1][11/46] BT 8.489 (11.890) DT 8.087 (10.215) loss 8.909 (9.116) Train: [1][12/46] BT 8.604 (11.616) DT 8.170 (10.045) loss 8.909 (9.098) Train: [1][13/46] BT 8.440 (11.372) DT 8.033 (9.890) loss 8.909 (9.084) Train: [1][14/46] BT 10.341 (11.298) DT 9.934 (9.893) loss 8.909 (9.071) Train: [1][15/46] BT 9.609 (11.185) DT 9.208 (9.848) loss 8.909 (9.061) Train: [1][16/46] BT 10.157 (11.121) DT 9.739 (9.841) loss 8.909 (9.051) Train: [1][17/46] BT 8.580 (10.972) DT 8.173 (9.743) loss 8.909 (9.043) Train: [1][18/46] BT 9.639 (10.897) DT 9.236 (9.715) loss 8.909 (9.035) Train: [1][19/46] BT 8.364 (10.764) DT 7.926 (9.620) loss 8.909 (9.029) Train: [1][20/46] BT 11.902 (10.821) DT 11.496 (9.714) loss 8.909 (9.023) Train: [1][21/46] BT 9.753 (10.770) DT 9.351 (9.697) loss 8.909 (9.017) Train: [1][22/46] BT 8.304 (10.658) DT 7.893 (9.615) loss 8.909 (9.012) Train: [1][23/46] BT 9.507 (10.608) DT 9.089 (9.592) loss 8.909 (9.008) Train: [1][24/46] BT 9.824 (10.575) DT 9.414 (9.585) loss 8.909 (9.004)

The loss was always stuck at 8.909. What is the reason for this? How can I fix it?

Thanks in advance.

Bedrettin-Cetinkaya avatar Jul 03 '21 16:07 Bedrettin-Cetinkaya

i've seen similar behavior in the pre-training stage. just curious: are the above results for pre-training? one thing that tended to help avoid pre-training plateaus was increasing batch size. if your dataset is similar to imagenet, try a batch size of at least 1024. no guarantees of course...

ibarrien avatar Jul 05 '21 06:07 ibarrien

hi, I also encountered this problem. Have you solved it?

zws-2019 avatar Jul 06 '21 11:07 zws-2019

I had the same problem. But I found that the pre-training plateaus continues for several epochs and then the loss falls down. How to solve it?

tianfr avatar Jul 09 '21 10:07 tianfr

I couldn't solve it, still have this problem.

Bedrettin-Cetinkaya avatar Jul 10 '21 12:07 Bedrettin-Cetinkaya

Hey @Bedrettin-Cetinkaya , have you solved it? I have a similar problem on a different dataset

eyalho avatar Aug 13 '21 18:08 eyalho

Can batch size 1024 be solved?

leiyu1980 avatar Sep 02 '21 14:09 leiyu1980

I had the exact same issue, after one batch the loss saturates and remains constant. Interestingly, it remained constant despite a large learning rate. Here are a few things you can do to fix it:

  • Reduce the batch size. It allows for more variance within the batch and help to get out of local minima.
  • Make sure your custom dataset is normalized. You should provide a mean and std when you run the script, otherwise you might have erratic gradients to will throw off the network and stall it on a plateau.
  • Reduce the learning rate, the default learning rate is made for large batch size. Reducing it might help.
  • Try another backbone, I believe something might be wrong in the current implementation of the ResNet model used here. Indeed using a model from the timm library I got a very different behaviour. (This last step really helped me).

hugovergnes avatar Feb 16 '22 01:02 hugovergnes

@hugovergnes Thanks for your suggestion, reduce the batch size works for me.

e96031413 avatar Jun 01 '22 02:06 e96031413

I think I've figured it out. The --cosine param in readme file is working with --warm, but in the config function parse_option, if batch_size>256, then warm will be true, but if batch_size is less than or equal to 256, then warm is turned off by default.

In short, append --warm , the loss will decrease as designed.

BeNhNp avatar Sep 02 '22 09:09 BeNhNp

I couldn't solve it, still have this problem.

Have you solved this question? I use supconloss for my dataset for batchsize=128 and loss don't decrease

Thewillman avatar Dec 08 '22 11:12 Thewillman