DeepResearch icon indicating copy to clipboard operation
DeepResearch copied to clipboard

gaia复现问题

Open glennccc opened this issue 4 months ago • 4 comments

我在tongyi-30b在gaia val上跑出来的指标只有50,请问目前开源的code有哪些地方上可能有问题的吗?例如我看tool_file上目前只有mp3去到VideoAgent(),感觉有点问题。

glennccc avatar Oct 27 '25 09:10 glennccc

能提供更多细节吗,因为GAIA的复现已经有人尝试过了,是完全可以复现结果的 Reference issue:https://github.com/Alibaba-NLP/DeepResearch/issues/173

{ "overall": {"avg_pass_at_3": 77.35, "best_pass_at_1": 82.52, "pass_at_3": 91.26}, "individual": {"Round1_Pass@1": 82.52, "Round2_Pass@1": 75.73, "Round3_Pass@1": 75.25}, "statistics": {"extra_length": 27.0, "num_invalid": 3.667, "avg_action": 16.814, "avg_visit_action": 8.458, "avg_search_action": 7.561, "avg_other_action": 0.796, "avg_ans_length": 4664.149, "avg_think_length": 2655.721, "avg_tool_calls_per_question": 16.814, "avg_assistant_tokens_per_question": 6275.463, "avg_assistant_tokens_per_message": 352.953, "termination_freq": {"answer": 0.961, "generate an answer as token limit reached": 0.003, "format error: generate an answer as token limit reached": 0.033, "exceed available llm calls": 0.003}, "avg_tool_calls_per_question_correctly_solved": 13.594, "avg_assistant_tokens_per_question_correctly_solved": 3183.946}}

likuanppd avatar Oct 28 '25 04:10 likuanppd

能提供更多细节吗,因为GAIA的复现已经有人尝试过了,是完全可以复现结果的 Reference issue:#173

{ "overall": {"avg_pass_at_3": 77.35, "best_pass_at_1": 82.52, "pass_at_3": 91.26}, "individual": {"Round1_Pass@1": 82.52, "Round2_Pass@1": 75.73, "Round3_Pass@1": 75.25}, "statistics": {"extra_length": 27.0, "num_invalid": 3.667, "avg_action": 16.814, "avg_visit_action": 8.458, "avg_search_action": 7.561, "avg_other_action": 0.796, "avg_ans_length": 4664.149, "avg_think_length": 2655.721, "avg_tool_calls_per_question": 16.814, "avg_assistant_tokens_per_question": 6275.463, "avg_assistant_tokens_per_message": 352.953, "termination_freq": {"answer": 0.961, "generate an answer as token limit reached": 0.003, "format error: generate an answer as token limit reached": 0.033, "exceed available llm calls": 0.003}, "avg_tool_calls_per_question_correctly_solved": 13.594, "avg_assistant_tokens_per_question_correctly_solved": 3183.946}}

这个issues里只跑了text-only的,我跑了165条全部val的。另外想问下summary model选择不同型号,对最终结果影响程度会有多大,这个你们有相关经验吗,我用了GLM4.5来做的summary

glennccc avatar Oct 28 '25 06:10 glennccc

@likuanppd 还有一个问题,你们主loop中不限制round上限,会不会有问题,我碰到几个case会不停重复之前相同的动作,像死循环。然后一直将round推到非常大,然后最终也没能给出答案到prediction里。

glennccc avatar Oct 28 '25 06:10 glennccc

能提供更多细节吗,因为GAIA的复现已经有人尝试过了,是完全可以复现结果的 Reference issue:#173 { "overall": {"avg_pass_at_3": 77.35, "best_pass_at_1": 82.52, "pass_at_3": 91.26}, "individual": {"Round1_Pass@1": 82.52, "Round2_Pass@1": 75.73, "Round3_Pass@1": 75.25}, "statistics": {"extra_length": 27.0, "num_invalid": 3.667, "avg_action": 16.814, "avg_visit_action": 8.458, "avg_search_action": 7.561, "avg_other_action": 0.796, "avg_ans_length": 4664.149, "avg_think_length": 2655.721, "avg_tool_calls_per_question": 16.814, "avg_assistant_tokens_per_question": 6275.463, "avg_assistant_tokens_per_message": 352.953, "termination_freq": {"answer": 0.961, "generate an answer as token limit reached": 0.003, "format error: generate an answer as token limit reached": 0.033, "exceed available llm calls": 0.003}, "avg_tool_calls_per_question_correctly_solved": 13.594, "avg_assistant_tokens_per_question_correctly_solved": 3183.946}}

这个issues里只跑了text-only的,我跑了165条全部val的。另外想问下summary model选择不同型号,对最终结果影响程度会有多大,这个你们有相关经验吗,我用了GLM4.5来做的summary

请问103条的text-only你找得到吗,还是需要人工筛选一遍啊

qcname avatar Nov 26 '25 02:11 qcname