Online-RLHF icon indicating copy to clipboard operation
Online-RLHF copied to clipboard

A recipe for online RLHF and online iterative DPO.

Results 16 Online-RLHF issues
Sort by recently updated
recently updated
newest added

Hi, I tried to reproduce the training process from sft to dpo. I ran the run_loop.sh script, the only change I made is setting initial_model="RLHFlow/LLaMA3-SFT". After 3 iterations, the final...

How train sft model on rtx4090?

When I tried to reproduce the results in the RLHFlow paper, I met some errors. This happens when I run get_rewards.py using 8 A100s. [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught...

Hi. The paper mentions that the offline vanilla DPO is trained on the Nectar dataset. I have several questions about that. 1. How do you process the Nectar dataset? Nectar...

Dear authors, 1. I noticed that the reference policy is fixed as the initial policy, instead of updating as the last iter's policy. May I know the reason for it...

### Issue: Implementing Iterative DPO on Phi3-4k-instruct Hi, thanks for the great work and open source! I am trying to implement iterative DPO on `Phi3-4k-instruct`. The following outlines my approach:...

Hi, congratulations to the great work and thanks for open source! I am running step 3.2 with pair-preference-model-LLaMA3-8B. However, I encountered the warning "Some weights of LlamaForSequenceClassification were not initialized...

I have some questions about the iterative pipeline. Please correct me if my understanding is wrong, thank you so much! From the report, \pi_0 should be the SFT policy trained...

Hello! When I serve ArmoRM-Llama3-8B-v0.1 using OpenRLHF, the output rewards are almost negative (around -2.0). I've attached some pictures of how I served the reward model. Is the output of...

When running the training code in the rlhflow environment, I encountered a TypeError with the message: DPOTrainer.__init__() got an unexpected keyword argument 'beta'. It seems like there might be an...