massive-activations icon indicating copy to clipboard operation
massive-activations copied to clipboard

Training only on 2B tokens (openwebtext)

Open Nandan91 opened this issue 1 year ago • 3 comments

Hi ! Interesting work on the role of explicit bias!

I was wondering what training settings got you an eval PPL ~3.04. The paper mentions that 50K iterations are required to train the GPT-2 model on 2B tokens. What was the bacth_size_per_device and block_size for the same? Did you do training from scratch or fine-tune the pre-trained model (trained on 300B tokens)?

Thanks!

Nandan91 avatar Mar 22 '24 18:03 Nandan91

Hi, Thanks for your interest in our work.

The training config is shown here, which i think will be automatically divided by the number of GPUs available (here).

We do not perform any fine-tuning but instead train all the GPT-2 models from scratch.

Eric-mingjie avatar Mar 22 '24 19:03 Eric-mingjie

Thanks for your reply.

The training configurations you referred to seem configured for the 600K training steps. As mentioned in the paper, you ran for 50K iterations to train only on 2 B tokens (the eval PPL you got is 3). Did you change anything else, such as learning rate, weight decay, etc.?

I trained for 50K iterations; however, my val loss remained ~3 (PPL >30).

Nandan91 avatar May 08 '24 19:05 Nandan91

No, i did not change anything such as learning rate or weight decay, I recall that my number is around those reported in the original nanogpt repo (https://github.com/karpathy/nanoGPT?tab=readme-ov-file#baselines).

Eric-mingjie avatar May 09 '24 02:05 Eric-mingjie