MMLU answer extraction regex fails with repeated "Answer: LETTER" pattern
Description
The regular expression used to extract answers for MMLU in common.py fails when the pattern "Answer: LETTER" appears multiple times in the LLM output, affecting model performance.
Example
The following example demonstrates the issue with a German output. The model correctly selects "C", but the regex extracts "A" as the answer.
Explanation
The regular expression mistakenly only considers the first occurrence of "Answer: LETTER".
https://github.com/openai/simple-evals/blob/a8e85cc8a5dea497d915f870895250e07f9cc737/common.py#L25-L71
In the German example above, it extracts the answer "A" from "Antwort:\n\nAntwort: C" because "Antwort:\n\nAntwort: C".
Impact
This bug significantly impacts the evaluation results for certain languages. In my experiments, German experienced this issue with ~20% of the samples, and Indonesian showed a ~4% impact. Other languages seem less affected.
A recent commit (https://github.com/openai/simple-evals/commit/18eba9d23d3a2fb39b5e8cf31036ea07148eae84) approached this issue partially by avoiding new lines between "Answer:" and the letter. So, "Antwort:\n\nAntwort: C" is not extracted anymore. However, "Antwort: Antwort: C" still is.
I think the issue and the pull request are still valid, although I have not assessed the impact of the new bug. I updated the pull request so it can be merged with no conflicts.