ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: Colossalai-OpenMoE-8b : loss value is very large and cannot converge

Open hangchen426926 opened this issue 2 years ago • 4 comments

🐛 Describe the bug

I am currently running the Colossalai/examples/language/openmoe project with the following experimental setup:

datasets: load_dataset("yizhongw/self_instruct/data/finetuning/self_instruct_221203", "super_natural_instructions"), model: openmoe-8b A100_num_GPU:8 Epoch:3 BatchSize:4 LR:0.00001 zero_stage:1 precision:bf16 Boost plugin:ep_zero extra_dp_size:2 max_length:2048

### Issue : loss value is very large and cannot converge I've encountered challenges during the training process where the convergence seems unachievable. The training loss value persists at an exceptionally high level (exceeding 2.3E+10) even after running 3 epochs. The logged information provided below showcases loss values: loss

Furthermore, in the default setup of the openmoe project, the lr is 0.00001. Considering this, I suspect the issue might stem from an overfitting problem. Consequently, I attempted to adjustment the lr parameter(from 0.00001 to 0.0000001) ,but still encountered the same problem. tempsnip tempsnip2

Environment

torch 1.13.1 python 3.8.17 cuda:11.7.17

hangchen426926 avatar Dec 28 '23 08:12 hangchen426926

Thank you for your valuable feedback! 😃We are working on this bug and will get back to you later.

Orion-Zheng avatar Dec 30 '23 19:12 Orion-Zheng

I also encountered this problem

noob-ctrl avatar Jan 05 '24 10:01 noob-ctrl

@Orion-Zheng Has this bug been solved?

noob-ctrl avatar Jan 08 '24 03:01 noob-ctrl