direct-preference-optimization icon indicating copy to clipboard operation
direct-preference-optimization copied to clipboard

Weird logits and model starts degeneration while training DPO

Open DungNasSa10 opened this issue 1 year ago • 2 comments

Recently, I have experimented DPO training for Vietnamese. I start with a strong SFT model, which is vinai/PhoGPT-4B-Chat, and follow the method described in CHEN, Zixiang, et al. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024. to make preference dataset from my own SFT dataset. I use trl for traninig with the config:

  • Deepspeed zero 3 offload
  • beta = 0.1
  • global_batch_size 128
  • learning_rate 1e-6
  • learning_rate_scheduler cosine
  • optim adam_torch
  • bf16 While training, the loss decreases very fast but after the first epoch, the logits of both chosen and rejected decreases to 0 and model suffer from degeneration (it generates repeated character `) after 1 epoch. Here is the full logs of the training process and a sample output of model, you can read more in column "PhoGPT-4B-Chat-SPIN-0-4K-one-turn-ep1" in the attached google sheet: 434616463_1399270230726618_2100947106694925368_n Screenshot from 2024-04-09 14-54-15 Screenshot from 2024-04-09 14-54-08

Do you have any suggest for this problem?

DungNasSa10 avatar Apr 09 '24 07:04 DungNasSa10

Hi, Did you solve the problem?

LLMforScience avatar May 08 '24 04:05 LLMforScience

This seems to be a problem with DeepSpeed ZeRO 3. If I use FSDP, everything works fine.

I tried using torch's AdamW instead of DS FusedAdam, the problem persists.

ggoggam avatar May 22 '24 05:05 ggoggam