MMLU answer extraction regex fails with repeated "Answer: LETTER" pattern

Open lucasresck opened this issue 1 year ago • 1 comments

Description

The regular expression used to extract answers for MMLU in common.py fails when the pattern "Answer: LETTER" appears multiple times in the LLM output, affecting model performance.

Example

The following example demonstrates the issue with a German output. The model correctly selects "C", but the regex extracts "A" as the answer.

unnamed

Explanation

The regular expression mistakenly only considers the first occurrence of "Answer: LETTER".

https://github.com/openai/simple-evals/blob/a8e85cc8a5dea497d915f870895250e07f9cc737/common.py#L25-L71

In the German example above, it extracts the answer "A" from "Antwort:\n\nAntwort: C" because "Antwort:\n\nAntwort: C".

Impact

This bug significantly impacts the evaluation results for certain languages. In my experiments, German experienced this issue with ~20% of the samples, and Indonesian showed a ~4% impact. Other languages seem less affected.

Dec 10 '24 18:12 lucasresck

A recent commit (https://github.com/openai/simple-evals/commit/18eba9d23d3a2fb39b5e8cf31036ea07148eae84) approached this issue partially by avoiding new lines between "Answer:" and the letter. So, "Antwort:\n\nAntwort: C" is not extracted anymore. However, "Antwort: Antwort: C" still is.

I think the issue and the pull request are still valid, although I have not assessed the impact of the new bug. I updated the pull request so it can be merged with no conflicts.

Feb 07 '25 20:02 lucasresck