An apparent bug in drop's dealing with multi-span answer
Hi there! 🤗
It seems that drop_metrics selects only the first span when an answer is of type multi-span:
https://github.com/huggingface/lighteval/blob/ad42e43bcc3bd50fdba68936999bf553bf53b9e4/src/lighteval/metrics/harness_compatibility/drop.py#L149-L153
Maybe we could remove the first if statement and change the second one accordingly to fix it.
I think we had to use this system to be compatible with the results of the Eleuther AI Harness - could you check if they have updated the mechanism so we see if we update on our side too? :)
It seems that it's correct there:
for gold_answer in golds:
exact_match, f1_score = get_metrics(preds, gold_answer)
if gold_answer[0].strip():
max_em = max(max_em, exact_match)
max_f1 = max(max_f1, f1_score)
return {"em": max_em, "f1": max_f1}
Yes, you are right, I checked the downstream logic and it seems like they don't take the first item anywhere - I think it's something we do by assuming the length is always one, but we would have to check this and add an assert. If you want to do edits and open a PR, feel free to :)