RLHF-Reward-Modeling Training and evaluating for pair

Hi,

I have replicated the training and evaluation for the pair_rm model, but I haven't achieved the results reported in Table 2 of the paper. The best results I obtained were with pm_models/llama3-8b-it_bs128_lr1e-5/checkpoint-1306:

Chat: 63.55 Chat Hard: 63.27 Safety: 82.59 Reasoning: 53.53 The main difference I've noticed in your script is that the base_model in your pair_pm/llama3-8b-it.yaml is /home/wx/axtool/models/llama3_it_with_padding_token. However, I couldn't find this model on Hugging Face or anywhere else. Therefore, I trained the pair_pm with meta-llama/Meta-Llama-3-8B-Instruct.

Another difference is in eval_reward_bench_pm.py. Similarly, you are using /home/cyeab/axtool/models/llama3_it_427_update for tokenizer and tokenizer_plain, while I used meta-llama/Meta-Llama-3-8B-Instruct instead.

Could you please share the llama3_it_with_padding_token and llama3_it_427_update models with me? Additionally, could you provide details on how you trained them?

Thank you!

Jul 09 '24 17:07 t-sifanwu

I think the llama3 with padding is obtained by adding a pad token to the original llama model. This can be done by calling the pair-pm/prepare_model.py script. I did so and the resulting model is as expected.

axoltol will mask some tokens and stop the gradients and the model's padding token should be set appropriately to get the expected performance I think.

Jul 09 '24 18:07 WayXG

Thanks for your reply! I still have another question about the training of bradley-terry-rm models. In the file of bradley-terry-rm/llama3_rm.py, you are using the dataset "hendrydong/preference_700K", is that the same as the mix2 you mentioned in the paper?

Jul 10 '24 17:07 t-sifanwu

yes, you can use henrydong/preference_700K and the script we provide to process it into the format used by pairwise preference dataset!

Jul 11 '24 02:07 WeiXiongUST

yes, you can use henrydong/preference_700K and the script we provide to process it into the format used by pairwise preference dataset!

Thanks for your reply! Since the provided data process file takes the input for standard format. Is that possible to provide the data process script to extract pairs? For example sharing the script transforming from the original ultrafeedback 63k dataset to RLHF format 340k standard dataset.

Jul 11 '24 17:07 t-sifanwu

yes, you can use henrydong/preference_700K and the script we provide to process it into the format used by pairwise preference dataset!

Thanks for your reply! Since the provided data process file takes the input for standard format. Is that possible to provide the data process script to extract pairs? For example sharing the script transforming from the original ultrafeedback 63k dataset to RLHF format 340k standard dataset.

Hi, you can check the dataset we provide in the huggingface RLHFlow organization. We provide the script for each dataset in the dataset card.

Jul 12 '24 02:07 WeiXiongUST

Training and evaluating for pair_pm model.