dolly icon indicating copy to clipboard operation
dolly copied to clipboard

Large loss jump in the beginning of second epoch in training

Open ghtaro opened this issue 2 years ago • 3 comments

Hi,

I have run SFT using my own checkpoint of GPTJ-6B model (by setting "input model" widget box to the place where the model and tokenizer are located).

I trained the above for 2 epochs and found that the train loss jump down at the beginning of 2 epoch. At the same time, eval loss jump up (you can see the picture below).

I did the same training a couple of days ago but with GPTJ-6B hugging face pretrained model and did not see the jumps.

Do you have any idea why this happens?

image

ghtaro avatar Apr 25 '23 12:04 ghtaro

@matthayes I think y'all observed that during training too, and it was mysterious. I think the answer was to turn down learning rate a bit? but the final value in the training script was already 'lower' to avoid this.

srowen avatar Apr 25 '23 14:04 srowen

I don't have a good explanation for why it jumps on your own checkpoint but not the pretrained GPTJ-6B. I have noticed this type of behavior before when training and lowering the learning rate has helped reduce the size of the jumps/drops in loss. It also helps to evaluate the model qualitatively at several checkpoints to see how well it performs by testing it on a fixed set of instructions. I've observed that the eval loss doesn't necessarily correlate well with qualitatively good results. This could have to do with the amount/nature of the training data and the test size.

matthayes avatar Apr 25 '23 22:04 matthayes

@srowen @matthayes thanks. Let me rerun the training with lower LR (5e-7 would be fine?) and will check the quality of inference on test dataset.

I am concerned with the fact that the jump happens on the really beginning the second epoch. It would not be luck.

  • Looks like that the model overfits the training data in the 1 epoch, and when it see them second time (in the 2 epoch) the train loss is very small. Is training data shuffled differently on the 2 epoch from the one on the 1 epoch?
  • Eval loss does not agree with my thought above. If overfit happens in the 1 epoch, the eval loss does not decrease in the 1 epoch.

ghtaro avatar Apr 26 '23 00:04 ghtaro