Question regarding ARMO stage2-train code
Thank you very much for open-sourcing such an excellent work as ARMO. I am currently reproducing the code for stage2-train. Based on the data you provided, I only made two modifications. First, I replaced the preference data with Skywork/Skywork-Reward-Preference-80K-v0.2, and second, I replaced the reference data with Skywork/Skywork-Reward-Preference-80K-v0.2 as well. The final training results are shown below, and the results remain the same even if I adjust the training steps or learning rate, and there is a significant performance gap compared to the model you provided. Do you know what might be causing this?
Also, I obtained the .pt file by training according to your code. Could you please provide a merged version of the code so that the model I train can maintain the same structure as the RLHFlow/ArmoRM-Llama3-8B-v0.1 you provided? Thank you very much!
Evaluating model...
Validation accuracy: 0.8965
Saved gating network to xxx/gating_network_FsfairX-LLaMA3-RM-v0.1_6k1.pt
Evaluating on RewardBench...
df_acc = pd.concat([df_acc, pd.DataFrame(row)], ignore_index=True)
RewardBench Scores:
Chat Chat Hard Safety Reasoning Score
0 99.162012 64.692981 89.099712 88.235938 85.3