evoeval Potential test case bugs in the difficult subset

Hi -- very nice eval.

I'm looking at the difficult subset and it seems like there are a number of problems that are incorrectly specified or have bugs in the reference solutions. Here are a few examples I've run into when checking the first ~30 tests.

evoeval-1: The task says inputs will be split by spaces into groups, but many tests are of the form "(()))()(()" with the expected output being "(())", "()" but this is wrong according to the spec, this should be treated as one group and dropped.

evoeval-14: In the example case all_prefix_suffix_pairs('abcadg', 2) it's not specified why ('abc', 'adg') shouldn't be valid. This also has length>=2 and is non-overlapping.

evoeval-22: Several test cases are wrong; on the input [True, False, None, 0, 1, 2] the output given is [false, 0, true, 1, 2] but false and true are not integers.

evoeval-23: The task says that whitespace should be removed, but the test case '\t\n', True, False says Expected 2 when '\t\n' is whitespace.

evoeval-3: The test case doesn't clearly state that there is one transaction per day, a valid interpretation is that all transactions are done in one day and so either exceeding the daily limit (in sum total) is invalid or going negative is invalid

evoeval-32: The polynomial [10, -15, 56, -40] on the input range [-10, 10] does cross zero; the test case asserts the answer is false however, because the two endpoints have the same sign. The task technically says this is the correct output but I'd argue the task description should be changed.

I'm curious what process was used to generate and filter these test cases? What do you think is the highest achievable accuracy on this dataset? (From my quick scanning it looks like maybe 70-80% would correspond to a saturated dataset.)

May 25 '25 20:05 carlini

The reference solution for problem 73 is I believe actually incorrect. The question asks:

"Given an array arr of integers, find the minimum number of elements that\n need to be changed to make the array palindromic and the sum of its elements divisible by a given integer k.\n A palindromic array is an array that is read the same backwards and forwards.\n In one change, you can change one element to any other integer."

It then provides a test case for [1,2,3,5,4,7,9,6], 14 on which the reference solution finds a minimum of 5 changes.

This is incorrect if I'm reading the problem right, and you can do it in 4 with [1,9,7,4,4,7,9,1].

May 26 '25 20:05 carlini

Hi @carlini

Thanks for checking out the dataset:

evoeval-1

It is mentioned later in the problem description that "Ignore any spaces and any non-parentheses characters in the input string." However I agree it can be misleading (and also pretty confusing) to say previously "multiple groups of nested parentheses separated by spaces" groups

evoeval-14

I'm a bit confused, in the example 'abc' and 'adg' overlaps in 'a', so it should be invalid

evoeval-22

For this problem I would point to the original HumanEval problems which also use the similiar method (i.e., isinstance(value, int)) to check if its an integer. I would also agree with you that it should not count

evoeval-23

Agreed that we should count all whitespace here

evoeval-3

That is a fair interpretation definitely, I agree it should be made clearer

evoeval-32

This seems to be a problem with our gt method of solving this problem

evoeval-73

you are right, this is a problem with our gt solution as well

I'm curious what process was used to generate and filter these test cases?

We first generate some example test cases with an LLM based on the question (i.e., the ones you see on the docstring) and we try to manually verify both the generated solution and run these example test cases to get the expected output (although sometimes they can still be wrong as pointed out by you). We then augment these test cases with new ones again with an LLM by simply prompting the LLM for more unique/difficult test cases.

What do you think is the highest achievable accuracy on this dataset? (From my quick scanning it looks like maybe 70-80% would correspond to a saturated dataset.)

I think ultimately all the problems should be solvable (apart from ones with slightly ambigious text) so I would guess the highest accuracy should be upwards of 80% if not more. But again we have not tested it on any of the newer models.

Jun 06 '25 02:06 brutalsavage