InfiniteBench
InfiniteBench copied to clipboard
Suprisingly low perf on code.Debug
Thanks for the work on this benchmark.
I was wondering why the baseline accuracies on code.Debug are so low.
de.Debug | 37.06% | < 5% | 17.77% | < 5% | 9.14% | 13.96% | 7.36%
Since it's multiple choice with four options, random guessing should give at least 25%. Have you released the outputs from your evaluation runs anywhere?
the context is long and noisy. LLMs tend to "think" rather than guessing. The output is also posted under our repo