Suprisingly low perf on code.Debug

Open seyuboglu opened this issue 1 year ago • 1 comments

Thanks for the work on this benchmark.

I was wondering why the baseline accuracies on code.Debug are so low.

de.Debug | 37.06% | < 5% | 17.77% | < 5% | 9.14% | 13.96% | 7.36%

Since it's multiple choice with four options, random guessing should give at least 25%. Have you released the outputs from your evaluation runs anywhere?

Jan 09 '25 02:01 seyuboglu

the context is long and noisy. LLMs tend to "think" rather than guessing. The output is also posted under our repo

May 14 '25 03:05 tuantuanzhang