Validation loss vs model size per step
Hi @Muennighoff Great paper, very impressive work and very detailed - thanks for releasing the data! I wonder about a small discrepancy that I see between your work and scaling rules. I replotted the data in figure 15 for 1 epoch, all 3 models on 1 plot:
You can see in Scaling Rules image that more parameters converge faster and have better loss. But in your experiments it seems that the 9B paraments model behave differently What are your thoughts about it? Thanks!
Thanks for pointing this out - great observation! The 8.7B model trained a bit less smoothly as you can see in its loss curves. Part of the reason is that we ran it with a different beta2 value as mentioned in Appendix S. Part of the reason could also be that it simply requires many more nodes which may introduce more imprecisions e.g. when casting things before/after communication. Maybe this is the reason why it is less sample-efficient? 🤔
Thanks for the response, I've tired to look at Chinchilla paper and there is this figure:
but the x-axes it FLOPs and not steps/tokens processed so it is hard to know exactly how it looks as a function of step.
BTW, in the paper I see that you ran more models:
7M, 146M, 212M etc.. I did not find the data for those runs, would be interesting to plot them also in X-tokens (steps) Y-loss.
They should all be here: https://huggingface.co/datablations/lm1-misc