evaluate icon indicating copy to clipboard operation
evaluate copied to clipboard

Combined metrics ignore some arguments

Open mattjeffryes opened this issue 2 years ago • 8 comments

Using evaluate.combine, some kwargs seem not to get passed to the sub metrics resulting in incorrect outputs.

Using the examples from precision

import evaluate
metric1 = evaluate.load('precision')
metric2 = evaluate.combine(['precision'])

print(metric1.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], pos_label=0))
print(metric2.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], pos_label=0))

Out:

{'precision': 0.6666666666666666}
{'precision': 0.5}

0.666... is the correct answer

import evaluate
metric1 = evaluate.load('precision')
metric2 = evaluate.combine(['precision'])

print(metric1.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], sample_weight=[0.9, 0.5, 3.9, 1.2, 0.3]))
print(metric2.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], sample_weight=[0.9, 0.5, 3.9, 1.2, 0.3]))

Out:

{'precision': 0.23529411764705882}
{'precision': 0.5}

0.235... is the correct answer

This issue occurred with all metrics I tried (precision, recall and F1).

Perhaps I am using the function incorrectly, but if so this behaviour was very surprising to me.

Mac OS 13.2 on M1, Python 3.10.9, evaluate 0.4.0

mattjeffryes avatar Feb 16 '23 15:02 mattjeffryes

You are right, it seems the keyword arguments are overridden in evaluate.combine. I'll send a PR

Plutone11011 avatar Feb 18 '23 10:02 Plutone11011

Hi! I think there's also a few more use cases where combine method doesn't process arguments correctly. Mainly looking at average argument. For example:

import evaluate

predictions = [0, 2, 1, 0, 0, 1]
references = [0, 1, 2, 0, 1, 2]

metrics = evaluate.combine(['precision', 'recall'])
metrics.compute(predictions, references, average='micro')

Or even when metric is initialized with average argument

import evaluate

predictions = [0, 2, 1, 0, 0, 1]
references = [0, 1, 2, 0, 1, 2]

metrics = evaluate.combine([evaluate.load('precision', average='micro'), evaluate.load('recall', average='micro')])
metrics.compute(predictions, references)

Both cases results in ValueError

ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].

bvezilic avatar Feb 28 '23 08:02 bvezilic

PR is merged, can you try if it works now?

lvwerra avatar Mar 14 '23 20:03 lvwerra

I can confirm that the my #1 case with average keyword works correctly.

Interestingly, as I was using a fresh python environment, evaluate doesn't initially have dependency to scikit-learn which is necessary to run metrics such as precision or recall. Maybe that's intended behaviour to reduce dependencies, but just wanted to mention it.

My #2 case still doesn't work, might be due to priority of what arguments are used. But I believe it's a separate issue than this one.

bvezilic avatar Mar 15 '23 18:03 bvezilic

I still have the same issue as @bvezilic. Is there a workaround?

Ioannis-Pikoulis avatar Aug 21 '23 13:08 Ioannis-Pikoulis

Same here. This is problematic when different metrics require different arguments, like this:

metrics = evaluate.combine(
    [
        evaluate.load("bertscore", lang="en"),
        evaluate.load("bleu"),
        evaluate.load('rouge', use_aggregator=False)
    ]
)

results = metrics.compute(predictions=predictions, references=groundtruths)

This returns the error (ValueError) Either 'lang' (e.g. 'en') or 'model_type' (e.g. 'microsoft/deberta-xlarge-mnli') must be specified.

If lang is indicated in the metrics.compute method, then other metrics, like bleu will throw the exception TypeError: _compute() got an unexpected keyword argument 'lang' , because this parameter is not for it.

santiagxf avatar Nov 08 '23 16:11 santiagxf

Related stackoverflow solution for adding the average parameter by using the METRIC_KWARGS class attribute of Evaluator.

dvdblk avatar Jun 13 '24 12:06 dvdblk