Add common test for classifiers reducing to less than two classes via sample weights during fit
Description
In issue https://github.com/scikit-learn/scikit-learn/issues/6433 (fixed by #10207), it was raised that y could reduce to contain less than two classes. During fixing that, Andreas made a comment here https://github.com/scikit-learn/scikit-learn/pull/10207#pullrequestreview-82909184 that it could occur for some other clasifiers as well. So the task is to add a common test in sklearn/metrics/tests/test_classification.py
The test should ensure that the classifier either fits fine, or raises an informative error message
Hi, can I try to write a test for this issue ?
sure. add it to estimator_checks.py
I do not knwon if it is normal but I noticed that SVC is not learning if only one class is present in y, but it is learning if it is trimmed with sample_weight.
import numpy as np
from sklearn.svm import SVC
est = SVC()
rnd = np.random.RandomState(0)
X = rnd.uniform(size=(10, 3))
y = np.arange(10) % 2
sample_weight = y
est.fit(X,y, sample_weight=sample_weight)
import numpy as np
from sklearn.svm import SVC
est = SVC()
rnd = np.random.RandomState(0)
X = rnd.uniform(size=(10, 3))
y = np.ones(10)
est.fit(X,y)
this one throws a ValueError: The number of classes has to be greater than one; got 1
EDIT: that is the case for all these classifiers:
- CalibratedClassifierCV
- LinearSVC
- LogisticRegression
- LogisticRegressionCV
- Perceptron
- SGDClassifier
- SVC
I think we should raise ValueError in the first case, just as we did in the GradientBoostingClassifier (in PR mentioned in issue description).
In the first case, I think it isn't actually learning anything (in a theoretical sense).
a dirty code to find which classifiers fit fail with only one label but not with the sample weight
import sklearn
from sklearn.utils.testing import all_estimators
import numpy as np
from sklearn.utils.validation import has_fit_parameter
both_failed = []
only_sample_weight_failed = []
only_one_class_failed = []
both_ok = []
for name, classifier in all_estimators(include_meta_estimators=False,type_filter='classifier'):
est = classifier()
if has_fit_parameter(est, "sample_weight"):
failed_one_class = False
failed_sample_one_class = False
rnd = np.random.RandomState(0)
X = rnd.uniform(size=(10, 3))
y = np.ones(10)
try:
est.fit(X,y)
except:
failed_one_class = True
y2 = np.arange(10) % 2
sample_weight = y2
try:
est.fit(X,y2,sample_weight=sample_weight)
except:
failed_sample_one_class = True
if failed_one_class and failed_sample_one_class:
both_failed.append(name)
elif failed_sample_one_class:
only_sample_weight_failed.append(name)
elif failed_one_class:
only_one_class_failed.append(name)
else:
both_ok.append(name)
print('both fit ok', both_ok)
print('both fit failed', both_failed)
print('fit failed on sample weight but ok on one class', only_sample_weight_failed)
print('fit ok on sample weight but failed on one class', only_one_class_failed)
I haven't tried to understand you code snippet. But I liked to point out that some of the classifier work ok with single class (even after sample_weight trimming), like RandomForestClassifier (Andreas had mentioned that).
Maybe that is not that important if we at least check that the prediction are what we are expecting. I would expect that the classifiers always output the remaining label and this can be add to the test just as it is done for the one label test without sample weight.
In the end, we have the following behaviour with the test on sample weight: let's set P1 = problem with only one label and P2 = problem with two labels reduce to one by using sample weight.
-
CalibratedClassifierCV,LinearSVC,LogisticRegression,LogisticRegressionCV,Perceptron,SGDClassifierandSVCwill fail on P1 but they will compute on P2. - If we add a test to check that the prediction only output the remaining label on P2, we then restrict the problem only to
linearSVC,SVC,GaussianNB,ComplementNB. - It seems that
linearSVCdoes not care about sample weight (this is not the case withSVCwith kernel= 'linear' in version 0.19.1):
from sklearn import svm
import numpy as np
lsvc = svm.LinearSVC()
X = np.random.RandomState(0).uniform(size=(10, 10))
X_test = np.random.RandomState(0).uniform(size=(10, 10))
y2 = np.arange(10) % 2
sample_weight = y2
lsvc.fit(X, y2)
print(lsvc.coef_)
print(lsvc.predict(X_test))
lsvc.fit(X, y2, sample_weight)
print(lsvc.coef_)
print(lsvc.predict(X_test))
-
SVCis strange as It will output only one label but not the one I was waiting for.
from sklearn import svm
import numpy as np
lsvc = svm.SVC()
X = np.random.RandomState(0).uniform(size=(10, 10))
X_test = np.random.RandomState(0).uniform(size=(10, 10))
y2 = np.arange(10) % 2
sample_weight = y2
lsvc.fit(X, y2, sample_weight)
print(lsvc.predict(X_test))
-
GaussianNBsample_weight has a strange behaviour because of a zeroDivision warning if the sample weights sum of a class becomes 0.
from sklearn.naive_bayes import GaussianNB
import numpy as np
lsvc = GaussianNB()
rnd = np.random.RandomState(3)
X = rnd.uniform(size=(10, 10))
X_test = rnd.uniform(size=(10, 10))
y2 = (np.arange(10) % 2)
sample_weight = (np.arange(10)) % 2
lsvc.fit(X, y2, sample_weight)
print(lsvc.predict(X_test))
-
ComplementNBdon't understand how the weights are used in the code.
Should I try to correct these problems inside this issue or should I open new issues for the different classifiers ?