SPPO icon indicating copy to clipboard operation
SPPO copied to clipboard

The official implementation of Self-Play Preference Optimization (SPPO)

Results 15 SPPO issues
Sort by recently updated
recently updated
newest added

I noticed [data-mistral-7b-instruct-sppo-iter1](https://huggingface.co/datasets/UCLA-AGI/data-mistral-7b-instruct-sppo-iter1) column rm_scores is a list length of 7 where as there are only 5 generated responses. data-mistral-7b-instruct-sppo-iter2 and data-mistral-7b-instruct-sppo-iter3 looks correct and both have length = 5

I am trying to reproduce the Mistral-7B-SPPO Iter1 model. However, after my first iteration, the model I trained diverged significantly from the published Mistral-7B-SPPO Iter1 model when comparing the results...

https://github.com/uclaml/SPPO/blob/e524519cc87e9e48cd4da30588f7aa566638df4c/scripts/compute_prob.py#L39 From my understanding of the code, the score list here is the output from the `blender.rank(*, return_scores=True)` which should output the average relative score of the response in the...

## Issue Data generation requires exactly 8 GPUs to be present. This doesn't make the code run properly on machines with less than 8 GPUs (for instance, I am using...

Hello authors, Great work! I added a quick PR to adapt generation to run on fewer than 8 GPUs if needed - https://github.com/uclaml/SPPO/pull/24. This is a minimally invasive change

Dear authors, may I know how we can train the iterative DPO baseline model using this repo? Is there a convenient way to modify the sppo code?

``` step 10: {'loss': 119743.8516, 'grad_norm': 938286.7284407256, 'learning_rate': 2.0161290322580643e-09, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -128.30323791503906, 'logps/chosen': -178.66146850585938, 'logits/rejected': -0.7681801915168762, 'logits/chosen': -0.792536735534668, 'epoch': 0.0} step 20: {'loss':...

Hey guys! For who is interested, I recently submitted a pull request to implements SPPO on Axolotl trainer, you can fallow the pull request here: https://github.com/axolotl-ai-cloud/axolotl/pull/1735 Original SPPO implementation fork:...

I found that the current repository configuration is not compatible with Gemma2. The reason might be that transformers and vllm are not fully compatible with Gemma2. Could you share the...

Hi, when I follow the default steps to set up environment: pip install vllm it will automaticly install vllm 0.5.0.post1, and transformers>=4.40.0 is required. When installing SPPO ( transformers==4.36.2 are...