InfiniteBench
InfiniteBench copied to clipboard
bug in computing scores for longdialogue_qa_eng
https://github.com/OpenBMB/InfiniteBench/blob/main/src/compute_scores.py#L238
- only one reference label is used for comparison, better loop around each answer in label, e.g., label=['ECKER', 'COMMANDER BILL ECKER'];
- prediction phrase is splitted into words for comparison, which means:
pred=STAMP PAID, label=['STAMP PAID'], score=0.0, data_name=longdialogue_qa_eng
and
pred=ADRIAN, label=['ADRIAN'], score=1.0, data_name=longdialogue_qa_eng
We have fixed this bug in the latest commit. Thank you for your notification.