dolly Large loss jump in the beginning of second epoch in training

Hi,

I have run SFT using my own checkpoint of GPTJ-6B model (by setting "input model" widget box to the place where the model and tokenizer are located).

I trained the above for 2 epochs and found that the train loss jump down at the beginning of 2 epoch. At the same time, eval loss jump up (you can see the picture below).

I did the same training a couple of days ago but with GPTJ-6B hugging face pretrained model and did not see the jumps.

Do you have any idea why this happens?

Apr 25 '23 12:04 ghtaro

@matthayes I think y'all observed that during training too, and it was mysterious. I think the answer was to turn down learning rate a bit? but the final value in the training script was already 'lower' to avoid this.

Apr 25 '23 14:04 srowen

I don't have a good explanation for why it jumps on your own checkpoint but not the pretrained GPTJ-6B. I have noticed this type of behavior before when training and lowering the learning rate has helped reduce the size of the jumps/drops in loss. It also helps to evaluate the model qualitatively at several checkpoints to see how well it performs by testing it on a fixed set of instructions. I've observed that the eval loss doesn't necessarily correlate well with qualitatively good results. This could have to do with the amount/nature of the training data and the test size.

Apr 25 '23 22:04 matthayes

@srowen @matthayes thanks. Let me rerun the training with lower LR (5e-7 would be fine?) and will check the quality of inference on test dataset.

I am concerned with the fact that the jump happens on the really beginning the second epoch. It would not be luck.

Looks like that the model overfits the training data in the 1 epoch, and when it see them second time (in the 2 epoch) the train loss is very small. Is training data shuffled differently on the 2 epoch from the one on the 1 epoch?
Eval loss does not agree with my thought above. If overfit happens in the 1 epoch, the eval loss does not decrease in the 1 epoch.

Apr 26 '23 00:04 ghtaro