John Yang

Results 92 comments of John Yang

Hi @bytesuji can you try running evaluation again? We have observed that a lot of people are facing similar challenges when it comes to setting up SWE-bench evaluation. We spent...

Hi @t-kurabayashi this will take a bit of time to run and confirm on our side, we will get back to you on this.

Hi @t-kurabayashi, thanks for your patience. We recently have made a number of improvements to the SWE-bench evaluation harness and wrote a report about it [here](https://github.com/princeton-nlp/SWE-bench/tree/main/docs/20240415_eval_bug). We will also upload...

Closing this issue for now, but feel free to re-open if problems persist!

We have updated the paper recently to reflect the corrections alluded to in this task instance. We'll also release the execution logs + predictions for all models run so far...

@zhimin-z @itaowei just a small update - thanks for your patience, we will be finalizing this and posting about it by the end of this month. In short, it will...

Thanks for pointing this out @JasonGross @skzhang1 along with the proposed solution. I agree that this is a bit inconvenient. The solution by @JasonGross would definitely work. For convenience, in...

@JasonGross I understand what you are saying. I think it is fine if users would like to use the work-around you suggested, but I don't plan to support this auto-incrementing...

Hi @LuoKaiGSW @anmolagarwal999, thanks for the great question. To clarify, this should actually be expected behavior. During the *validation* phase of task instances, to determine whether a task instance is...

Hi all, thanks for your patience, we will respond more promptly going forward. We realized that many have been running into common evaluation harness errors. We have spent the last...