Siyuan Zhu
Siyuan Zhu
lol
> > > Instead of using a linear predictor, GenRM leverages CoT and next-token prediction to provide reward. GenRM is proven to be more accurate. https://arxiv.org/abs/2410.12832 > > > >...
哥哥们,假如我想在奖励函数里调用奖励模型来生成奖励,现在最佳示范是什么啊,谢谢哥哥们
Thank you, that's of great help!
> 如果想用GRPO的话可以用RM吗 我觉得自己写一个奖励,在奖励里调用RM应该就可以