AllenShow comments

Results 7 comments of


                                            AllenShow

Obtaining my own critic data

YES! There is no README in process_data as mentioned here data_creation/critic/gpt4_reward/README.md

About baseline's parameter 'task'

And the same question for the parameter 'max_new_tokens', should it be set to the same value as the self-RAG's setting in each particular task for compare?

Build data for critic model

I also want to know the answer and a complete description of the data creation process.

[Bug] DeepSeek R1 32B 模型测评 AIME2024 数据集得分低

请问你用的数据集和aime2024_gen不带版本号默认指向的aime2024_gen_6e39a4有啥区别，评测chat模型应该用哪个呢？

[Feature] 目前是否有适配Codeforces、SWE Verified、Aider-Polyglot这些在R1中出现的数据集的计划呢？

[Bug] gpqa_gen数据集得出的结果很低

> GPQA_gen points to gpqa_openai_simple_evals_gen_5aeece.py, which requires the model outputs “ANSWER: $LETTER”. 所以如果想要更通用一点地对chat模型进行评测，可以用gpqa_gen_4baadb是吗？

[Bug] Take too much time on MATH-500 dataset evaluation

> Please check the prediction to find if there is a repeat pattern in response. And reduce the --max-out-len. Also, you can remove the --debug and use four workers. Could...

AllenShow