John Yang comments

Results 92 comments of


                                            John Yang

Unable to replicate basic results

Hi @bytesuji can you try running evaluation again? We have observed that a lot of people are facing similar challenges when it comes to setting up SWE-bench evaluation. We spent...

Tests that should fail don't fail.

Hi @t-kurabayashi this will take a bit of time to run and confirm on our side, we will get back to you on this.

Tests that should fail don't fail.

Hi @t-kurabayashi, thanks for your patience. We recently have made a number of improvements to the SWE-bench evaluation harness and wrote a report about it [here](https://github.com/princeton-nlp/SWE-bench/tree/main/docs/20240415_eval_bug). We will also upload...

Tests that should fail don't fail.

Closing this issue for now, but feel free to re-open if problems persist!

May I ask where can I download the generated results from Claude and GPTs?

We have updated the paper recently to reflect the corrections alluded to in this task instance. We'll also release the execution logs + predictions for all models run so far...

What are expected to submit for the leaderboard integration?

@zhimin-z @itaowei just a small update - thanks for your patience, we will be finalizing this and posting about it by the end of this month. In short, it will...

logs are unusable with multiple test instances

Thanks for pointing this out @JasonGross @skzhang1 along with the proposed solution. I agree that this is a bit inconvenient. The solution by @JasonGross would definitely work. For convenience, in...

logs are unusable with multiple test instances

@JasonGross I understand what you are saying. I think it is fine if users would like to use the work-around you suggested, but I don't plan to support this auto-incrementing...

sometimes gold_patch cannot pass the test

Hi @LuoKaiGSW @anmolagarwal999, thanks for the great question. To clarify, this should actually be expected behavior. During the *validation* phase of task instances, to determine whether a task instance is...

sometimes gold_patch cannot pass the test

Hi all, thanks for your patience, we will respond more promptly going forward. We realized that many have been running into common evaluation harness errors. We have spent the last...