lighteval `LightevalTask.process_results()` is not aligned with `LightevalTask.get_request

Hi there!

LightevalTask.process_results() does not expect the same ordering of requests by type as what is implied by LightevalTask.get_request_type():

create_requests_from_tasks returns a dict of requests whose keys' order follows the return value of get_request_type(). This request dict is iterated in evaluate() and the responses of different request types belonging to a specific task_example are put into a list with the same order as that of the get_request_type() as a consequence. By process_results() expects another order which could break things.

A reproducing example:

from lighteval.tasks.lighteval_task import LightevalTask, LightevalTaskConfig
from lighteval.tasks.requests import Doc, TaskExampleId
from lighteval.logging.evaluation_tracker import EvaluationTracker
from lighteval.evaluator import evaluate
from lighteval.models.base_model import BaseModel, BaseModelConfig, EnvConfig
import collections

def test_get_request_type_and_process_results_alignment():
    config = LightevalTaskConfig("test", "arc", "test","test", ["loglikelihood_acc", "exact_match"], evaluation_splits='test',generation_size=4, stop_sequence=["\n"])
    task = LightevalTask("test", config)
    doc = Doc("Which one?", ["A"], 0)
    reqs = task.construct_requests(doc, doc.query, "", "test|0")
    all_reqs = collections.defaultdict(list)
    req_types = task.get_request_type()
    for req_type in req_types:
        all_reqs[req_type].extend(reqs[req_type])
    model_config = BaseModelConfig("hf-internal-testing/tiny-random-gpt2")
    model = BaseModel(model_config, EnvConfig())
    evaluation_tracker = EvaluationTracker()
    evaluate(model, all_reqs, {TaskExampleId("test|0",""): doc}, {'test': task}, 1, evaluation_tracker)

Output:

self = <lighteval.metrics.metrics_sample.ExactMatches object at 0x7f744b3f8ca0>, gold = 'A'
pred = -6.878859996795654

    def compute_one_item(
        self,
        gold: str,
        pred: str,
    ) -> float:
        """Compares two strings only.
    
        Args:
            gold (str): One of the possible references
            pred (str): One of the possible predictions
    
        Returns:
            float: The exact match score. Will be 1 for a match, 0 otherwise.
        """
        if not pred:
            return 0
    
        if self.strip_strings:
            gold = gold.strip()
>           pred = pred.strip()
E           AttributeError: 'float' object has no attribute 'strip'

The solution might be to make get_request_type() return always the same ordering (currently it runs set() on its intermediate request type list and Set doesn't guarantee the order) and change process_results to expect the same ordering for responses.

By the way, thank you for this great library. My issues & PRs are due to my job task on evaluating LLMs in a specific domain. I would be happy to tackle this issue if it helps.

May 20 '24 23:05 sadra-barikbin

We're a bit swamped but thanks a lot for your interest in the lib! We'll come back to it full speed in hopefully a week, and will do our best to address your PRs

May 22 '24 13:05 clefourrier

@clefourrier , shall I send new PRs or wait until then if it makes inconvenience?

May 23 '24 09:05 sadra-barikbin

It's not inconvenient at all, we just won't have time to work on them and give you feedback for at least a week :)

May 23 '24 09:05 clefourrier

A cleaner example with the same error:

from unittest.mock import Mock
from lighteval.tasks.registry import Registry, taskinfo_selector
from lighteval.tasks.lighteval_task import create_requests_from_tasks, LightevalTask
from lighteval.models.base_model import BaseModel, BaseModelConfig
from lighteval.models.model_config import EnvConfig
from lighteval.evaluator import evaluate

model_config = BaseModelConfig("hf-internal-testing/tiny-random-LlamaForCausalLM")
base_model= BaseModel(model_config, EnvConfig())
task_names_list, few_shots_dict = taskinfo_selector("original|arc:c:letters|0|0")
task_dict = Registry(cache_dir="").get_task_dict(task_names_list)
LightevalTask.load_datasets(task_dict.values())
requests, docs = create_requests_from_tasks(
        task_dict=task_dict,
        fewshot_dict=few_shots_dict,
        num_fewshot_seeds=0,
        lm=base_model,
        max_samples=1,
        evaluation_tracker=Mock(),
        use_chat_template=False,
        system_prompt="",
)
evaluate(base_model, requests, docs, task_dict, 1, Mock())

Jul 05 '24 18:07 sadra-barikbin

`LightevalTask.process_results()` is not aligned with `LightevalTask.get_request_type()`