Report separate QA metrics in eval for "no answer" labels

Open tholor opened this issue 3 years ago • 0 comments

Is your feature request related to a problem? Please describe. We don't have any mechanism yet in pipeline.eval() for users to distinguish the performance of Reader models on "no answer" labels vs. regular labels. This makes it hard to compare performance between generative and extractive models - and is also quite useful just for extractive cases to understand the weak spots of the model. For example, if the model is just performing bad on no answer labels, you would typically calibrate the no_ans_boost or confidence_threshold in the Reader. It is also quite common to report these metrics separately (e.g. Official SQuAD eval script, results in our own model cards or our old eval in FARM)

Describe the solution you'd like Adjust EvaluationResult.calculate_metrics() to return the additional QA metrics:

"f1_has_answer"
"exact_match_has_answer"
"sas_has_answer"
"exact_match_no_answer"

(we can probably leave out "f1_no_answer" and "sas_no_answer", as they would be the same as "exact_match_no_answer")

Wording of the variables could of course be different. Could be a quick win, as we already have all data for this in the dataframe + the logic of calculating those metrics. We probably just need to filter the dataframe differently before calculating these metrics.

Sep 21 '22 07:09 tholor