evaluate Integrate `scikit-learn` metrics into `evaluate`

Summary

We want to support the wide range of metrics implemented in scikit-learn in evaluate. While this expands the capabilities of evaluate it also gives users from the scikit-learn ecosystem access to useful tools in evaluate such as pushing results to the hub or end-to-end evaluate models with the evaluator classes. As a bonus, all metrics will get an interactive widget that can be embedded in various places such as the docs.

Goal

The goal of this integration should be that metrics from scikit-learn can be loaded from evaluate with the following API:

import evaluate

metric = evaluate.load("sklearn/accuracy")
metric.compute(predictions=[0, 1, 1], references=[1, 1, 0])

How it can be done

For the integration we could build a script that goes through all metrics of the scikit-learn repository and automatically builds the metric repositories in the evaluate format and pushes them to the Hub. This could be a script that's executed via a GitHub action whenever a change is pushed to main similar to how it's done for the internal modules (see here).

Besides the function, its arguments and input/output format we can also use the docs to populate the gradio widget on the hub. See the Accuracy module as an example of how the metrics could be displayed.

Sep 21 '22 09:09 lvwerra

can anyone work on this ?

Sep 21 '22 10:09 sezan92

I'll be more than happy to work on this new integration.

Oct 11 '22 08:10 Mouhanedg56

#self-assign

Oct 11 '22 08:10 Mouhanedg56

I have some ideas on how this could be done and I'll start drafting a PR hopefully this week and I can tag you on it @Mouhanedg56!

Oct 11 '22 08:10 lvwerra

Okey, sounds good. I was trying listing scikit-learn metrics using inspect module and it works very well:

from inspect import getmembers, isfunction

from sklearn import metrics
print([func for func in getmembers(metrics, isfunction) if func[0].endswith("score")])

Output

[('accuracy_score', <function accuracy_score at 0x13f2c5b80>), 
('adjusted_mutual_info_score', <function adjusted_mutual_info_score at 0x13f2dcdc0>), 
('adjusted_rand_score', <function adjusted_rand_score at 0x13f2dc790>), 
('average_precision_score', <function average_precision_score at 0x13f2bf4c0>), 
('balanced_accuracy_score', <function balanced_accuracy_score at 0x13f2d4af0>), 
('calinski_harabasz_score', <function calinski_harabasz_score at 0x13f9294c0>), 
('cohen_kappa_score', <function cohen_kappa_score at 0x13f2c5ee0>), 
('completeness_score', <function completeness_score at 0x13f2dc9d0>), 
('consensus_score', <function consensus_score at 0x13f929a60>), 
('davies_bouldin_score', <function davies_bouldin_score at 0x13f9295e0>), 
('dcg_score', <function dcg_score at 0x13f2c50d0>), 
('explained_variance_score', <function explained_variance_score at 0x13f93e040>), 
('f1_score', <function f1_score at 0x13f2d43a0>), 
('fbeta_score', <function fbeta_score at 0x13f2d44c0>), 
('fowlkes_mallows_score', <function fowlkes_mallows_score at 0x13f2e8040>), 
('homogeneity_score', <function homogeneity_score at 0x13f2dc8b0>), 
('jaccard_score', <function jaccard_score at 0x13f2d4040>), 
('label_ranking_average_precision_score', <function label_ranking_average_precision_score at 0x13f2bfb80>), 
('mutual_info_score', <function mutual_info_score at 0x13f2dcca0>), 
('ndcg_score', <function ndcg_score at 0x13f2c5280>), 
('normalized_mutual_info_score', <function normalized_mutual_info_score at 0x13f2dcee0>), 
('precision_score', <function precision_score at 0x13f2d48b0>), 
('r2_score', <function r2_score at 0x13f93e160>), 
('rand_score', <function rand_score at 0x13f2dc700>), 
('recall_score', <function recall_score at 0x13f2d49d0>), 
('roc_auc_score', <function roc_auc_score at 0x13f2bf700>), 
('silhouette_score', <function silhouette_score at 0x13f9293a0>), 
('top_k_accuracy_score', <function top_k_accuracy_score at 0x13f2c51f0>), 
('v_measure_score', <function v_measure_score at 0x13f2dcb80>)]

Oct 11 '22 10:10 Mouhanedg56

I have some ideas on how this could be done and I'll start drafting a PR hopefully this week and I can tag you on it @Mouhanedg56!

@lvwerra , i would like to contribute as well. is it possible?

Oct 12 '22 05:10 sezan92

@Mouhanedg56 looks promising with the inspect module.

func[0].endswith("score")])

Is it certain the metrics will always end with score (new metrics in future as well)? Or is this just a prototype and you plan something more robust for final implementation?

Oct 17 '22 20:10 noobmldude

Any progress on this? Has it been accomplished?

Sep 27 '23 13:09 mirix

It's WIP in #383 but evaluate is currently in low-maintenance mode.

Sep 29 '23 10:09 lvwerra

Thanks, would this be the correct way of using scikit-learn directly in order to avoid evaluate?

It seems to be working...

from sklearn.metrics import balanced_accuracy_score, precision_score, recall_score, f1_score, matthews_corrcoef, cohen_kappa_score, jaccard_score

    zd = np.nan
    aver="micro"
 
    def compute_metrics(pred):
        labels = pred.label_ids
        preds = pred.predictions.argmax(-1)
        
        precision = precision_score(labels, preds, average=aver, zero_division=zd)
        recall = recall_score(labels, preds, average=aver, zero_division=zd)
        f1 = f1_score(labels, preds, average=aver, zero_division=zd)
        accuracy = balanced_accuracy_score(labels, preds)
        mcc = matthews_corrcoef(labels, preds)
        ckc = cohen_kappa_score(labels, preds)
        jaccard = jaccard_score(labels, preds, average=aver, zero_division=0)
        
        return {
            'accuracy': accuracy,
            'precision': precision,
            'recall': recall,
            'f1': f1,
            'mcc': mcc,
            'ckc': ckc,
            'jaccard': jaccard
            }

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=raw_datasets["train"] if training_args.do_train else None,
        eval_dataset=raw_datasets["eval"] if training_args.do_eval else None,
        compute_metrics=compute_metrics,
        tokenizer=feature_extractor,
    )

And then

--metric_for_best_model f1

in the command line.

Sep 29 '23 11:09 mirix