RLHF-Reward-Modeling icon indicating copy to clipboard operation
RLHF-Reward-Modeling copied to clipboard

Question regarding ARMO stage2-train code

Open RayWang-iat opened this issue 1 year ago • 0 comments

Thank you very much for open-sourcing such an excellent work as ARMO. I am currently reproducing the code for stage2-train. Based on the data you provided, I only made two modifications. First, I replaced the preference data with Skywork/Skywork-Reward-Preference-80K-v0.2, and second, I replaced the reference data with Skywork/Skywork-Reward-Preference-80K-v0.2 as well. The final training results are shown below, and the results remain the same even if I adjust the training steps or learning rate, and there is a significant performance gap compared to the model you provided. Do you know what might be causing this?

Also, I obtained the .pt file by training according to your code. Could you please provide a merged version of the code so that the model I train can maintain the same structure as the RLHFlow/ArmoRM-Llama3-8B-v0.1 you provided? Thank you very much!

Evaluating model...
Validation accuracy: 0.8965
Saved gating network to xxx/gating_network_FsfairX-LLaMA3-RM-v0.1_6k1.pt 

Evaluating on RewardBench...

  df_acc = pd.concat([df_acc, pd.DataFrame(row)], ignore_index=True)
RewardBench Scores:
        Chat  Chat Hard     Safety  Reasoning  Score
0  99.162012  64.692981  89.099712  88.235938   85.3

RayWang-iat avatar Oct 15 '24 02:10 RayWang-iat