Results of AgentPoirot with Claude-3.7-Sonnet as Backbone
Hi,
Thank you for your contributions.
I ran some experiments using the code you provided and switched from GPT-4o to Claude-3.7-Sonnet. I would expect similar performance to GPT-4o when using Claude. However, the current run only achieves 0.25 ROUGE-1 on both insight-level and summary-level evaluation, which is far behind the reported results (i.e., ~0.32).
Our parameters are:
{ "benchmark_type": "full", "branch_depth": 4, "max_questions": 3, "model_name": "claude" }
This should be aligned with your experiments reported in the paper. Could you please give some suggestions for our experiments?
Best, Ethan
@wjhou thanks for trying out our benchmark! i don't think ROUGE is a metric we should really be looking at to measure model quality. we report ROUGE scores because it's a standard metric and we have to, but it's quite clear in the literature that it's full of holes due to its lack of semantic measurement. for all we know, your insights could be much higher quality and ROUGE would still penalize your model for not using the exact words we have in the gt insights. you can compute the llm score with claude as the backend (for a subset to start with) and see if that makes more sense.
Hi @Demfier
Thank you for your prompt reply. I greatly appreciate your willingness to help. Would you mind sharing the results of AgentPoirot and Pandas Agent for comparison purposes? Having access to these benchmarks would significantly expedite our research process.
Best regards, Ethan