direct-preference-optimization issues

How are evals done on trained models?

Thanks for putting this together. I am wondering how are evals done on trained models. Are there some third-party evaluation libraries that you use to measure trained model performance/metric, or...

lesnikow

Weird logits and model starts degeneration while training DPO

2

Recently, I have experimented DPO training for Vietnamese. I start with a strong SFT model, which is [vinai/PhoGPT-4B-Chat](https://huggingface.co/vinai/PhoGPT-4B-Chat), and follow the method described in [CHEN, Zixiang, et al. Self-play fine-tuning...

DungNasSa10

Why does SFT sum the cross-entropy loss within each sequence?

3

Thank you for maintaining such an important repository. I really enjoyed and learned a lot from reading your DPO paper. I have one question regarding the SFT loss implementation in...

yunjae-won

llama7B issue

17

Hi, i am trying to run the SFT step, using 4 A100 80GB, report error: `starting 4 processes for FSDP training setting RLIMIT_NOFILE soft limit to 1048576 from 1048576 /opt/conda/lib/python3.8/multiprocessing/resource_tracker.py:216:...

JiuhaiChen

Hi @eric-mitchell ,

3

In your formula (the image below), it seems that the log[π(y|x)] was calculate through .sum(-1) after logits.softmax(-1), then .log(). ![image](https://github.com/eric-mitchell/direct-preference-optimization/assets/125982410/4c2a025c-b30e-40cf-9241-2d1a4c4db858) But in your codes (the image below), the log[π(y|x)] was...

Gryff1ndor

Unable to Run SFT

4

When I run the SFT script in the example by choosing `BasicTrainer` instead of `FSDPTrainer` and by disabling wandb logging to avoid other issues: `python -u train.py model=pythia28 datasets=[hh] loss=sft...

Rui-Yuan91

where is config document of ipo?

1

It seems that the IPO's config file is missing here, which prevents the IPO from running

3244we

Using Mistral 7B with transformers v4.38.1 on MATH dataset, and facing memory leaks

In both the Trainers, Basic, and FSDP, there is an underlying pattern of GPU memory not being freed. Allocation keeps increasing in steps while utilization remains roughly constant. ![image](https://github.com/eric-mitchell/direct-preference-optimization/assets/31920414/53663685-e3dc-4118-a1df-f825c7b9374c) Does...

Jayant1234

Implementation for Plackett-Luce rank model

1

@eric-mitchell Will you be adding the implementation for Plackett-Luce rank model in addition to the current Bradley-Terry model? Looking forward to hearing from you!

rohan598

Reproducing Win Rate inference for TL;DR

1

Hi, I have been trying to reproduce the win rate results from the paper for summarization and I'm struggling to get similar values. I wonder if you've experienced this as...

jdchang1

direct-preference-optimization
direct-preference-optimization copied to clipboard

Metadata

How are evals done on trained models?

Weird logits and model starts degeneration while training DPO

Why does SFT sum the cross-entropy loss within each sequence?

llama7B issue

Hi @eric-mitchell ,

Unable to Run SFT

where is config document of ipo?

Using Mistral 7B with transformers v4.38.1 on MATH dataset, and facing memory leaks

Implementation for Plackett-Luce rank model

Reproducing Win Rate inference for TL;DR

← Metadata

Owner

Metadata

direct-preference-optimization direct-preference-optimization copied to clipboard

Metadata

← Metadata

Owner

Metadata

direct-preference-optimization
direct-preference-optimization copied to clipboard