Adding SimPO to TRL
Hello,
This PR adds SimPO implementations to TRL.
The official codebase: https://github.com/princeton-nlp/SimPO
Thank you! Yu
Hello,
This PR adds SimPO implementations to TRL.
The official codebase: https://github.com/princeton-nlp/SimPO
Thank you! Yu
@yumeng5 hi! simpo has been implemented in cpotrainer but not yet released
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Hi @yumeng5, we have added SimPO to this PR: https://github.com/huggingface/trl/pull/1703. What do you think? We are also happy to incorporate some of the documentation you have in this PR if #1703 is a good implementation of SimPO.
@vwxyzjn Hi! Thank you for reaching out. We have carefully considered the integration of CPO and SimPO and have a few concerns:
Hyperparameter Differences: Although CPO and SimPO share some hyperparameters, the optimal values can vary significantly. For example, beta tends to be much larger in SimPO compared to other policy optimization algorithms.
Additional Hyperparameter for SimPO: SimPO introduces an extra hyperparameter, the gamma_beta ratio, which plays a crucial role in determining the appropriate target reward margin. Unlike CPO, which implements this as a constant gamma, we find that tuning the gamma_beta ratio in SimPO is more effective for reaching an optimal point.
Our experiments show that the optimal values for these hyperparameters, as well as the learning rate, can differ significantly from those used in DPO, SLiC-HF, or CPO, despite the similarities in their formulations. Therefore, we think it's better to develop a separate Trainer for SimPO to better highlight its unique requirements for users. What do you think?
Hi @xiamengzhou, there are excellent reasons, and they make sense. However, I do have reservations about bloating the codebase from the maintenance point of view.
Maybe two possibilities are going forward:
-
better documentation and example: for example, you could add a
simpo.pyin theexamples/scriptsfolder and by default, set the hyperparameters that make sense for simpo like here ( https://github.com/yumeng5/trl/blob/223ce737d651a78a9b54ee1a4472fc4e4eb61760/examples/scripts/simpo.py#L19-L32). Most people use examples when playing with the trainers anyway, so you can ensure that people use the rightbetaand so on. Then, in the documentation you can explain that SimPO uses different optimal hyperparameters and recommend the users. -
do a very light wrapper of CPOTrainer. For example, instead of inheriting from
TrainingArgumetns,maybe inherit fromCPOConfig,, and you can name itSimPOConfig.That way you can override the defaultbetavalues. Similarly, you can inherit from CPOTrainer.
@vwxyzjn Thanks for the response! Your suggestions make sense to me and we will revise the PR accordingly :) Will ping you here once we are done.
Hi @vwxyzjn
Thanks again for your suggestions. We have modified the PR by doing a wrapper of the CPOTrainer. Would you mind taking a look and letting us know if that looks good to you?
Thanks!
Hi @qgallouedec
Thanks again for your help with reviewing our PR! Let me know if any further revision is needed!
Thanks, Yu
Hey @yumeng5 I made some new remarks, and there are still the previous ones to address. Don't hesitate to ping me if you need help.
Hi @qgallouedec
Thanks a lot for your comprehensive review! Following your suggestion, I'm updating the SimPO trainer to be a complete one, which also follows the current setup for other trl trainers. I think I have committed all the changes you made previously. Let me know if they look good to you!
Thanks, Yu
Hi @qgallouedec @lewtun
I noticed that the PR has remained open for a while -- just wanted to check in and see if there's anything I can do to help! Thanks again for your efforts!
Best, Yu
Hi @yumeng5 thanks for adding your implementation to TRL! I just have a few quick questions.
- The existing implementation uses CPOTrainer for training SimPO models. However, by using this approach, I was not able to reproduce those model checkpoints or metrics linked on your Princeton Github: https://github.com/princeton-nlp/SimPO/tree/main Even though I tried to use the same hyperparameters and dataset, the trained models are so much worse, e.g., they produce gibberish quite frequently. I'm wondering if this PR is a better reflection of your original work.
- In your original implementation, there is a special preprocessing step by keeping only the
model/assistantmessage in thechosen/rejectedfield: https://github.com/princeton-nlp/SimPO/blob/main/scripts/run_simpo.py#L89-L99. I also see this is the case for this PR. So I'm wondering if the users need to provide their own preprocessing given a dataset where thechosenandrejectedfields contain both user (i.e., prompt) and assistant messages (e.g., https://huggingface.co/datasets/trl-lib/ultrafeedback_binarized?row=0)?
I'm closing this PR because no recent activity. Please feel free to reopen it or request a reopening if necessary.