Online-RLHF issues

Cannot Reproduce the DPO Checkpoint

1

Hi, I tried to reproduce the training process from sft to dpo. I ran the run_loop.sh script, the only change I made is setting initial_model="RLHFlow/LLaMA3-SFT". After 3 iterations, the final...

gesy17

How train sft on rtx4090?

1

How train sft model on rtx4090?

utrobinmv

Question about CUDA/NVCC setups

1

When I tried to reproduce the results in the RLHFlow paper, I met some errors. This happens when I run get_rewards.py using 8 A100s. [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught...

rqzhangberkeley

Questions about Nectar Datasets

4

Hi. The paper mentions that the offline vanilla DPO is trained on the Nectar dataset. I have several questions about that. 1. How do you process the Nectar dataset? Nectar...

XinZhao0211

Reference policy ablations

9

Dear authors, 1. I noticed that the reference policy is fixed as the initial policy, instead of updating as the last iter's policy. May I know the reason for it...

yesiam-png

Phi3 has a nearly constant DPO loss of 0.69xx

6

### Issue: Implementing Iterative DPO on Phi3-4k-instruct Hi, thanks for the great work and open source! I am trying to implement iterative DPO on `Phi3-4k-instruct`. The following outlines my approach:...

Arnav0400

Fail to load weight from pair-preference-model-LLaMA3-8B

2

Hi, congratulations to the great work and thanks for open source! I am running step 3.2 with pair-preference-model-LLaMA3-8B. However, I encountered the warning "Some weights of LlamaForSequenceClassification were not initialized...

matouk98

Iterative pipeline question

4

I have some questions about the iterative pipeline. Please correct me if my understanding is wrong, thank you so much! From the report, \pi_0 should be the SFT policy trained...

matouk98

Negative reward when serving ArmoRM-Llama3-8B-v0.1

4

Hello! When I serve ArmoRM-Llama3-8B-v0.1 using OpenRLHF, the output rewards are almost negative (around -2.0). I've attached some pictures of how I served the reward model. Is the output of...

maoliyuan

Unexpected keyword argument 'beta' in DPOTrainer initialization

1

When running the training code in the rlhflow environment, I encountered a TypeError with the message: DPOTrainer.__init__() got an unexpected keyword argument 'beta'. It seems like there might be an...

YijuGuo

Online-RLHF
Online-RLHF copied to clipboard

Metadata

Cannot Reproduce the DPO Checkpoint

How train sft on rtx4090?

Question about CUDA/NVCC setups

Questions about Nectar Datasets

Reference policy ablations

Phi3 has a nearly constant DPO loss of 0.69xx

Fail to load weight from pair-preference-model-LLaMA3-8B

Iterative pipeline question

Negative reward when serving ArmoRM-Llama3-8B-v0.1

Unexpected keyword argument 'beta' in DPOTrainer initialization

← Metadata

Owner

Metadata

Online-RLHF Online-RLHF copied to clipboard

Metadata

← Metadata

Owner

Metadata

Online-RLHF
Online-RLHF copied to clipboard