evaluate Combined metrics ignore some arguments

Using evaluate.combine, some kwargs seem not to get passed to the sub metrics resulting in incorrect outputs.

Using the examples from precision

import evaluate
metric1 = evaluate.load('precision')
metric2 = evaluate.combine(['precision'])

print(metric1.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], pos_label=0))
print(metric2.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], pos_label=0))

Out:

{'precision': 0.6666666666666666}
{'precision': 0.5}

0.666... is the correct answer

import evaluate
metric1 = evaluate.load('precision')
metric2 = evaluate.combine(['precision'])

print(metric1.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], sample_weight=[0.9, 0.5, 3.9, 1.2, 0.3]))
print(metric2.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], sample_weight=[0.9, 0.5, 3.9, 1.2, 0.3]))

Out:

{'precision': 0.23529411764705882}
{'precision': 0.5}

0.235... is the correct answer

This issue occurred with all metrics I tried (precision, recall and F1).

Perhaps I am using the function incorrectly, but if so this behaviour was very surprising to me.

Mac OS 13.2 on M1, Python 3.10.9, evaluate 0.4.0

Feb 16 '23 15:02 mattjeffryes

You are right, it seems the keyword arguments are overridden in evaluate.combine. I'll send a PR

Feb 18 '23 10:02 Plutone11011

Hi! I think there's also a few more use cases where combine method doesn't process arguments correctly. Mainly looking at average argument. For example:

import evaluate

predictions = [0, 2, 1, 0, 0, 1]
references = [0, 1, 2, 0, 1, 2]

metrics = evaluate.combine(['precision', 'recall'])
metrics.compute(predictions, references, average='micro')

Or even when metric is initialized with average argument

import evaluate

predictions = [0, 2, 1, 0, 0, 1]
references = [0, 1, 2, 0, 1, 2]

metrics = evaluate.combine([evaluate.load('precision', average='micro'), evaluate.load('recall', average='micro')])
metrics.compute(predictions, references)

Both cases results in ValueError

ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].

Feb 28 '23 08:02 bvezilic

PR is merged, can you try if it works now?

Mar 14 '23 20:03 lvwerra

I can confirm that the my #1 case with average keyword works correctly.

Interestingly, as I was using a fresh python environment, evaluate doesn't initially have dependency to scikit-learn which is necessary to run metrics such as precision or recall. Maybe that's intended behaviour to reduce dependencies, but just wanted to mention it.

My #2 case still doesn't work, might be due to priority of what arguments are used. But I believe it's a separate issue than this one.

Mar 15 '23 18:03 bvezilic

I still have the same issue as @bvezilic. Is there a workaround?

Aug 21 '23 13:08 Ioannis-Pikoulis

Same here. This is problematic when different metrics require different arguments, like this:

metrics = evaluate.combine(
    [
        evaluate.load("bertscore", lang="en"),
        evaluate.load("bleu"),
        evaluate.load('rouge', use_aggregator=False)
    ]
)

results = metrics.compute(predictions=predictions, references=groundtruths)

This returns the error (ValueError) Either 'lang' (e.g. 'en') or 'model_type' (e.g. 'microsoft/deberta-xlarge-mnli') must be specified.

If lang is indicated in the metrics.compute method, then other metrics, like bleu will throw the exception TypeError: _compute() got an unexpected keyword argument 'lang' , because this parameter is not for it.

Nov 08 '23 16:11 santiagxf

Related stackoverflow solution for adding the average parameter by using the METRIC_KWARGS class attribute of Evaluator.

Jun 13 '24 12:06 dvdblk