Megatron-DeepSpeed Plot scaling laws of our baseline models

For our three baselines on different datasets (OSCAR, C4, The Pile), we would like to plot scaling laws and retrieve their coefficients. Specifically, we are looking to reproduce Figure 1 of Scaling Laws for Neural Language Models.

The TensorBoard data for the baseline runs can be retrieved on the Big Science space on HuggingFace: it's the tr3 runs with tensorboard in their name. The naming scheme (tr3b, tr3c, etc.) is explained here. For C4, we have a XL, L, and M model (tr3, tr3c, tr3c) with short warm-up. For OSCAR and The Pile, we have an XL, L, M, and S model (tr3d, tr3g, tr3h, tr3i and tr3, tr3j, tr3k, tr3l). For OSCAR, we can should also add the 13B run to see if the fits hold (that's tr1-13B).

Oct 05 '21 07:10 slippylolo

so just to make sure-the loss is taken from the "lm-loss-validation/lm loss validation"? from the last step? or from the global minimum loss?

Oct 06 '21 20:10 srulikbd

I've temporarily assigned @slippylolo , feel free to re-assign.

Oct 21 '21 23:10 thomasw21