lighteval An apparent bug in drop's dealing with multi-span answer

Hi there! 🤗

It seems that drop_metrics selects only the first span when an answer is of type multi-span:

https://github.com/huggingface/lighteval/blob/ad42e43bcc3bd50fdba68936999bf553bf53b9e4/src/lighteval/metrics/harness_compatibility/drop.py#L149-L153

Maybe we could remove the first if statement and change the second one accordingly to fix it.

May 21 '24 11:05 sadra-barikbin

I think we had to use this system to be compatible with the results of the Eleuther AI Harness - could you check if they have updated the mechanism so we see if we update on our side too? :)

May 22 '24 13:05 clefourrier

It seems that it's correct there:

    for gold_answer in golds:
        exact_match, f1_score = get_metrics(preds, gold_answer)
        if gold_answer[0].strip():
            max_em = max(max_em, exact_match)
            max_f1 = max(max_f1, f1_score)
    return {"em": max_em, "f1": max_f1}

May 22 '24 16:05 sadra-barikbin

Yes, you are right, I checked the downstream logic and it seems like they don't take the first item anywhere - I think it's something we do by assuming the length is always one, but we would have to check this and add an assert. If you want to do edits and open a PR, feel free to :)

May 23 '24 08:05 clefourrier