RLHF-Reward-Modeling
RLHF-Reward-Modeling copied to clipboard
Update eval_bench_mark.py
Use len(names) instead of 13 allows to run part of the evaluation benchmark each time, for machine does not have that much g-ram, this could be helpful.
Thanks for the pr! I am busy with some projects as it approaches the final recently... I will get back to you as soon as possible.
hi, I just updated the evaluation script to support weighted average over different subsets. The current result matches that of the official leaderboard now. Could you re-pull a request accordingly?