Are there any standard instructions for the prompts?
Is there a standard for testing? Why do some test cases need to return None for exception detection and others need to throw an exception? Some need to check the existence of the file, some did write it will lead to run failure?
Is there a standard for testing?
Please take a look at the 🚀 Remote Evaluation. You can learn more here.
Why do some test cases need to return None for exception detection and others need to throw an exception? Some need to check the existence of the file, some did write it will lead to run failure?
They are designed to align with the docstrings / descriptions of the tasks.
I looked closely at BigCodeBench/12 and why there is only subprocess. Call () , while LLM would prefer to use subprocess. Run -LRB-) and cause an error to occur. This point is not mentioned at all in completeprompt, resulting in the large model being misjudged. How is this resolved? This is true for many test cases, as well as for multiple test cases that throw valueerrors but are still not mentioned in the title.
??方?方?? @.***
------------------ 原始邮件 ------------------ 发件人: "bigcode-project/bigcodebench" @.>; 发送时间: 2025年5月18日(星期天) 上午6:55 @.>; @.@.>; 主题: Re: [bigcode-project/bigcodebench] Are there any standard instructions for the prompts? (Issue #96)
terryyz left a comment (bigcode-project/bigcodebench#96)
Is there a standard for testing?
Please take a look at the 🚀 Remote Evaluation. You can learn more here.
Why do some test cases need to return None for exception detection and others need to throw an exception? Some need to check the existence of the file, some did write it will lead to run failure?
They are designed to align with the docstrings / descriptions of the tasks.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
How is this resolved? This is true for many test cases, as well as for multiple test cases that throw valueerrors but are still not mentioned in the title.
We've noticed this for a while. There will be an upcoming version mitigating this issue :-)