datablations icon indicating copy to clipboard operation
datablations copied to clipboard

Validation loss vs model size per step

Open orena1 opened this issue 1 year ago • 3 comments

Hi @Muennighoff Great paper, very impressive work and very detailed - thanks for releasing the data! I wonder about a small discrepancy that I see between your work and scaling rules. I replotted the data in figure 15 for 1 epoch, all 3 models on 1 plot:

Image

You can see in Scaling Rules image that more parameters converge faster and have better loss. But in your experiments it seems that the 9B paraments model behave differently What are your thoughts about it? Thanks!

orena1 avatar Feb 20 '25 03:02 orena1

Thanks for pointing this out - great observation! The 8.7B model trained a bit less smoothly as you can see in its loss curves. Part of the reason is that we ran it with a different beta2 value as mentioned in Appendix S. Part of the reason could also be that it simply requires many more nodes which may introduce more imprecisions e.g. when casting things before/after communication. Maybe this is the reason why it is less sample-efficient? 🤔

Muennighoff avatar Feb 20 '25 06:02 Muennighoff

Thanks for the response, I've tired to look at Chinchilla paper and there is this figure:

Image but the x-axes it FLOPs and not steps/tokens processed so it is hard to know exactly how it looks as a function of step.

BTW, in the paper I see that you ran more models: Image

7M, 146M, 212M etc.. I did not find the data for those runs, would be interesting to plot them also in X-tokens (steps) Y-loss.

orena1 avatar Feb 20 '25 06:02 orena1

They should all be here: https://huggingface.co/datablations/lm1-misc

Muennighoff avatar Feb 20 '25 07:02 Muennighoff