DeepResearch gaia复现问题

我在tongyi-30b在gaia val上跑出来的指标只有50，请问目前开源的code有哪些地方上可能有问题的吗？例如我看tool_file上目前只有mp3去到VideoAgent()，感觉有点问题。

Oct 27 '25 09:10 glennccc

能提供更多细节吗，因为GAIA的复现已经有人尝试过了，是完全可以复现结果的 Reference issue：https://github.com/Alibaba-NLP/DeepResearch/issues/173

{ "overall": {"avg_pass_at_3": 77.35, "best_pass_at_1": 82.52, "pass_at_3": 91.26}, "individual": {"Round1_Pass@1": 82.52, "Round2_Pass@1": 75.73, "Round3_Pass@1": 75.25}, "statistics": {"extra_length": 27.0, "num_invalid": 3.667, "avg_action": 16.814, "avg_visit_action": 8.458, "avg_search_action": 7.561, "avg_other_action": 0.796, "avg_ans_length": 4664.149, "avg_think_length": 2655.721, "avg_tool_calls_per_question": 16.814, "avg_assistant_tokens_per_question": 6275.463, "avg_assistant_tokens_per_message": 352.953, "termination_freq": {"answer": 0.961, "generate an answer as token limit reached": 0.003, "format error: generate an answer as token limit reached": 0.033, "exceed available llm calls": 0.003}, "avg_tool_calls_per_question_correctly_solved": 13.594, "avg_assistant_tokens_per_question_correctly_solved": 3183.946}}

Oct 28 '25 04:10 likuanppd

能提供更多细节吗，因为GAIA的复现已经有人尝试过了，是完全可以复现结果的 Reference issue：#173

{ "overall": {"avg_pass_at_3": 77.35, "best_pass_at_1": 82.52, "pass_at_3": 91.26}, "individual": {"Round1_Pass@1": 82.52, "Round2_Pass@1": 75.73, "Round3_Pass@1": 75.25}, "statistics": {"extra_length": 27.0, "num_invalid": 3.667, "avg_action": 16.814, "avg_visit_action": 8.458, "avg_search_action": 7.561, "avg_other_action": 0.796, "avg_ans_length": 4664.149, "avg_think_length": 2655.721, "avg_tool_calls_per_question": 16.814, "avg_assistant_tokens_per_question": 6275.463, "avg_assistant_tokens_per_message": 352.953, "termination_freq": {"answer": 0.961, "generate an answer as token limit reached": 0.003, "format error: generate an answer as token limit reached": 0.033, "exceed available llm calls": 0.003}, "avg_tool_calls_per_question_correctly_solved": 13.594, "avg_assistant_tokens_per_question_correctly_solved": 3183.946}}

这个issues里只跑了text-only的，我跑了165条全部val的。另外想问下summary model选择不同型号，对最终结果影响程度会有多大，这个你们有相关经验吗，我用了GLM4.5来做的summary

Oct 28 '25 06:10 glennccc

@likuanppd 还有一个问题，你们主loop中不限制round上限，会不会有问题，我碰到几个case会不停重复之前相同的动作，像死循环。然后一直将round推到非常大，然后最终也没能给出答案到prediction里。

Oct 28 '25 06:10 glennccc

能提供更多细节吗，因为GAIA的复现已经有人尝试过了，是完全可以复现结果的 Reference issue：#173 { "overall": {"avg_pass_at_3": 77.35, "best_pass_at_1": 82.52, "pass_at_3": 91.26}, "individual": {"Round1_Pass@1": 82.52, "Round2_Pass@1": 75.73, "Round3_Pass@1": 75.25}, "statistics": {"extra_length": 27.0, "num_invalid": 3.667, "avg_action": 16.814, "avg_visit_action": 8.458, "avg_search_action": 7.561, "avg_other_action": 0.796, "avg_ans_length": 4664.149, "avg_think_length": 2655.721, "avg_tool_calls_per_question": 16.814, "avg_assistant_tokens_per_question": 6275.463, "avg_assistant_tokens_per_message": 352.953, "termination_freq": {"answer": 0.961, "generate an answer as token limit reached": 0.003, "format error: generate an answer as token limit reached": 0.033, "exceed available llm calls": 0.003}, "avg_tool_calls_per_question_correctly_solved": 13.594, "avg_assistant_tokens_per_question_correctly_solved": 3183.946}}

这个issues里只跑了text-only的，我跑了165条全部val的。另外想问下summary model选择不同型号，对最终结果影响程度会有多大，这个你们有相关经验吗，我用了GLM4.5来做的summary

请问103条的text-only你找得到吗，还是需要人工筛选一遍啊

Nov 26 '25 02:11 qcname