direct-preference-optimization
direct-preference-optimization copied to clipboard
Reference implementation for DPO (Direct Preference Optimization)
Thanks for putting this together. I am wondering how are evals done on trained models. Are there some third-party evaluation libraries that you use to measure trained model performance/metric, or...
Recently, I have experimented DPO training for Vietnamese. I start with a strong SFT model, which is [vinai/PhoGPT-4B-Chat](https://huggingface.co/vinai/PhoGPT-4B-Chat), and follow the method described in [CHEN, Zixiang, et al. Self-play fine-tuning...
Thank you for maintaining such an important repository. I really enjoyed and learned a lot from reading your DPO paper. I have one question regarding the SFT loss implementation in...
Hi, i am trying to run the SFT step, using 4 A100 80GB, report error: `starting 4 processes for FSDP training setting RLIMIT_NOFILE soft limit to 1048576 from 1048576 /opt/conda/lib/python3.8/multiprocessing/resource_tracker.py:216:...
In your formula (the image below), it seems that the log[π(y|x)] was calculate through .sum(-1) after logits.softmax(-1), then .log().  But in your codes (the image below), the log[π(y|x)] was...
When I run the SFT script in the example by choosing `BasicTrainer` instead of `FSDPTrainer` and by disabling wandb logging to avoid other issues: `python -u train.py model=pythia28 datasets=[hh] loss=sft...
It seems that the IPO's config file is missing here, which prevents the IPO from running
In both the Trainers, Basic, and FSDP, there is an underlying pattern of GPU memory not being freed. Allocation keeps increasing in steps while utilization remains roughly constant.  Does...
@eric-mitchell Will you be adding the implementation for Plackett-Luce rank model in addition to the current Bradley-Terry model? Looking forward to hearing from you!
Hi, I have been trying to reproduce the win rate results from the paper for summarization and I'm struggling to get similar values. I wonder if you've experienced this as...