direct-preference-optimization Weird logits and model starts degeneration while training DPO

Recently, I have experimented DPO training for Vietnamese. I start with a strong SFT model, which is vinai/PhoGPT-4B-Chat, and follow the method described in CHEN, Zixiang, et al. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024. to make preference dataset from my own SFT dataset. I use trl for traninig with the config:

Deepspeed zero 3 offload
beta = 0.1
global_batch_size 128
learning_rate 1e-6
learning_rate_scheduler cosine
optim adam_torch
bf16 While training, the loss decreases very fast but after the first epoch, the logits of both chosen and rejected decreases to 0 and model suffer from degeneration (it generates repeated character `) after 1 epoch. Here is the full logs of the training process and a sample output of model, you can read more in column "PhoGPT-4B-Chat-SPIN-0-4K-one-turn-ep1" in the attached google sheet:

Do you have any suggest for this problem?

Apr 09 '24 07:04 DungNasSa10

Hi, Did you solve the problem?

May 08 '24 04:05 LLMforScience

This seems to be a problem with DeepSpeed ZeRO 3. If I use FSDP, everything works fine.

I tried using torch's AdamW instead of DS FusedAdam, the problem persists.

May 22 '24 05:05 ggoggam