`LightevalTask.process_results()` is not aligned with `LightevalTask.get_request_type()`
Hi there!
LightevalTask.process_results() does not expect the same ordering of requests by type as what is implied by LightevalTask.get_request_type():
create_requests_from_tasks returns a dict of requests whose keys' order follows the return value of get_request_type(). This request dict is iterated in evaluate() and the responses of different request types belonging to a specific task_example are put into a list with the same order as that of the get_request_type() as a consequence. By process_results() expects another order which could break things.
A reproducing example:
from lighteval.tasks.lighteval_task import LightevalTask, LightevalTaskConfig
from lighteval.tasks.requests import Doc, TaskExampleId
from lighteval.logging.evaluation_tracker import EvaluationTracker
from lighteval.evaluator import evaluate
from lighteval.models.base_model import BaseModel, BaseModelConfig, EnvConfig
import collections
def test_get_request_type_and_process_results_alignment():
config = LightevalTaskConfig("test", "arc", "test","test", ["loglikelihood_acc", "exact_match"], evaluation_splits='test',generation_size=4, stop_sequence=["\n"])
task = LightevalTask("test", config)
doc = Doc("Which one?", ["A"], 0)
reqs = task.construct_requests(doc, doc.query, "", "test|0")
all_reqs = collections.defaultdict(list)
req_types = task.get_request_type()
for req_type in req_types:
all_reqs[req_type].extend(reqs[req_type])
model_config = BaseModelConfig("hf-internal-testing/tiny-random-gpt2")
model = BaseModel(model_config, EnvConfig())
evaluation_tracker = EvaluationTracker()
evaluate(model, all_reqs, {TaskExampleId("test|0",""): doc}, {'test': task}, 1, evaluation_tracker)
Output:
self = <lighteval.metrics.metrics_sample.ExactMatches object at 0x7f744b3f8ca0>, gold = 'A'
pred = -6.878859996795654
def compute_one_item(
self,
gold: str,
pred: str,
) -> float:
"""Compares two strings only.
Args:
gold (str): One of the possible references
pred (str): One of the possible predictions
Returns:
float: The exact match score. Will be 1 for a match, 0 otherwise.
"""
if not pred:
return 0
if self.strip_strings:
gold = gold.strip()
> pred = pred.strip()
E AttributeError: 'float' object has no attribute 'strip'
The solution might be to make get_request_type() return always the same ordering (currently it runs set() on its intermediate request type list and Set doesn't guarantee the order) and change process_results to expect the same ordering for responses.
By the way, thank you for this great library. My issues & PRs are due to my job task on evaluating LLMs in a specific domain. I would be happy to tackle this issue if it helps.
We're a bit swamped but thanks a lot for your interest in the lib! We'll come back to it full speed in hopefully a week, and will do our best to address your PRs
@clefourrier , shall I send new PRs or wait until then if it makes inconvenience?
It's not inconvenient at all, we just won't have time to work on them and give you feedback for at least a week :)
A cleaner example with the same error:
from unittest.mock import Mock
from lighteval.tasks.registry import Registry, taskinfo_selector
from lighteval.tasks.lighteval_task import create_requests_from_tasks, LightevalTask
from lighteval.models.base_model import BaseModel, BaseModelConfig
from lighteval.models.model_config import EnvConfig
from lighteval.evaluator import evaluate
model_config = BaseModelConfig("hf-internal-testing/tiny-random-LlamaForCausalLM")
base_model= BaseModel(model_config, EnvConfig())
task_names_list, few_shots_dict = taskinfo_selector("original|arc:c:letters|0|0")
task_dict = Registry(cache_dir="").get_task_dict(task_names_list)
LightevalTask.load_datasets(task_dict.values())
requests, docs = create_requests_from_tasks(
task_dict=task_dict,
fewshot_dict=few_shots_dict,
num_fewshot_seeds=0,
lm=base_model,
max_samples=1,
evaluation_tracker=Mock(),
use_chat_template=False,
system_prompt="",
)
evaluate(base_model, requests, docs, task_dict, 1, Mock())