albert [ALBERT] How to deal with `Model diverged with loss = NaN` when training from scratch?

hi,

I tried to train albert base model from scratch on 64 v100 gpus. Part of my hyper parameters are : init_lr=2.75e-5 (i.e. 0.00173/64) train_batch_size=48 train_steps=32millions, so 500K for each gpu

I used the LAMP optimizer. My training was crashed at around 43K due to Model diverged with loss = NaN. Does anyone know how to address this ? On v100, I cann't set very large train batch size

Jan 08 '20 16:01 wxp16

your representation is so confusing. train_batch_size=48 means batch size per gpu? or global batch size. if you use global batch size 48 your learning rate should be scaled (48/4096)^0.5 times not 1/64. and warm up step should be remain same. train step should be scaled (4096/48) times. or if you use local batch size 48 and global batch size 48*64=3062, your learning rate should be scaled (3062/4096)^0.5 times, warm up step remain same and train step should be scaled (4096/3092) times and what is lamp optimizer? not lamb?

Jan 09 '20 05:01 seongwook-ham

your representation is so confusing. train_batch_size=48 means batch size per gpu? or global batch size. if you use global batch size 48 your learning rate should be scaled (48/4096)^0.5 times not 1/64. and warm up step should be remain same. train step should be scaled (4096/48) times. or if you use local batch size 48 and global batch size 48*64=3062, your learning rate should be scaled (3062/4096)^0.5 times, warm up step remain same and train step should be scaled (4096/3092) times and what is lamp optimizer? not lamb?

Sorry for my incomplete info gave previously. The train batch size is 48 per gpu. The optimizer I used is LAMB.

Jan 09 '20 15:01 wxp16

your representation is so confusing. train_batch_size=48 means batch size per gpu? or global batch size. if you use global batch size 48 your learning rate should be scaled (48/4096)^0.5 times not 1/64. and warm up step should be remain same. train step should be scaled (4096/48) times. or if you use local batch size 48 and global batch size 48*64=3062, your learning rate should be scaled (3062/4096)^0.5 times, warm up step remain same and train step should be scaled (4096/3092) times and what is lamp optimizer? not lamb?

Why the learning rate should be scaled to (3062/4096)^0.5 times if train_batch_size=48 is for each gpu?

Jan 09 '20 16:01 wxp16

see LARGE BATCH OPTIMIZATION FOR DEEP LEARNING: TRAINING BERT IN 76 MINUTES(https://arxiv.org/abs/1904.00962) also you'd better specify warm up step

Jan 09 '20 16:01 seongwook-ham

see LARGE BATCH OPTIMIZATION FOR DEEP LEARNING: TRAINING BERT IN 76 MINUTES(https://arxiv.org/abs/1904.00962) also you'd better specify warm up step

I did another round ALBERT BASE model training. I used Horovod to train on 64 GPUs, train_batch_size=32 per GPU, global train batch size=2048, train_steps=250K. According to TRAINING BERT IN 76 MINUTES, I used the square root learning rate scaling, so the initial learning rate is 0.00173/(2^0.5), in which 0.00173 is the learning rate for train_batch_size=4096. As the warm-up steps/proportion, I set 3125/0.0125, 6250/0.025, 25K/0.1. The optimizer I used is LAMP.

But unfortunately all my training were crashed by Model diverged with loss = NaN . Any suggestions about this? Is it possible to replicate ALBERT model training from scratch ?

BTW, I didn't use google's run_pretrain.py, but the code I used is almost same, we just added some training acceleration techniques, e.g. horovod, amp and xla.

Thanks.

Jan 10 '20 17:01 wxp16

in paper they do not use constant warm proportion. instead they use constant warm up step. it means with batch size 2048 you have to set warm up step 3125.
in many code with horovod, learning rate is automatically scaled horovod_size times. it means that your learning rate might be scaled 64 times. so you have to remove learning rate = learning*horovod size part
in horovod you have to split tfrecord to many files(at least your horovod size). have you done this?

Jan 10 '20 23:01 seongwook-ham