SklearnQueryClassifier
Describe the bug
When using the SklearnQueryClassifier in an attempt to classify sentences, I get an error.
Error message
AttributeError: 'GradientBoostingClassifier' object has no attribute '_loss'.
Expected behavior Would expect to be able to load and run inference with the query classifier without issue.
To Reproduce
!pip install -q farm-haystack==1.6.0
from haystack.pipeline import SklearnQueryClassifier
gbt_clf = SklearnQueryClassifier()
gbt_clf.run(query="Is this a question?")
FAQ Check
- [X] Have you had a look at our new FAQ page?
System:
- OS: Mac
- GPU/CPU: CPU
- Haystack version (commit or version number): 1.6.0
- DocumentStore: N/A
- Reader: N/A
- Retriever: N/A
Hi @jstremme this looks to happen with scikit-learn version 1.1.1. If you downgrade to pip install scikit-learn==1.0.2 then this classifier should work.
We will look into what needs to be updated to get the newest version of scikit-learn to work.
Thanks @sjrl, this is a helpful workaround.
@sjrl @ZanSara I believe that, to solve this issue, we can retrain the SklearnQueryClassifier model using the new version of scikit-learn.
This is probably not a permanent solution but I have no better ideas. WDYT?
The original data and the training procedure can be found here: #611
Hello @anakin87! The issue you linked is super-long :smile: I have the impression that re-training the model might be a viable solution. I'm surprised it's needed, but I trust your judgement here.
Let us know if it's viable on a consumer machine or if you need additional computing resources.
In the long run, I see these two possible actions:
-
retrain the (2!) scikit-learn models whenever a new release breaks things. See Security & maintainability limitations from scikit-learn docs. (To do this, we should find the exact versions of the dataset and training code. Probably hidden in #611)
-
understand if the skops project can help us: for the moment, they allow pushing a scikit-learn model to HF Hub and are providing a safer alternative to pickle. Let's see if they introduce something for better compatibility of models across different scikit-learn versions.
Meanwhile, I'm preparing a monkey patch to solve the problem for now.
@jstremme We just merged a PR that patches the issue thanks to @anakin87 . Please let us know if the issue remains, thank you.
After a long and pleasant :smiley: reading of #611, I report the main information.
Keywords vs Questions/Statements (Default)
- Dataset: Quora question keyword pairs https://www.kaggle.com/stefanondisponibile/quora-question-keyword-pairs
- Training Notebook: https://www.kaggle.com/shahrukhkhan/question-v-statement-detection?scriptVersionId=63376602
Questions vs. Statements
- Dataset: SPAADIA v2
- Raw: http://martinweisser.org/amex_a_corpus.zip
- Parsed: https://www.kaggle.com/shahrukhkhan/questions-vs-statementsclassificationdataset
- Training Notebook: https://www.kaggle.com/code/shahrukhkhan/questions-v-statement-gradient-boosting-classifier
When I have some time, I will try to retrain the models with the latest version of scikit-learn...
I trained the new models in the following notebooks:
Here you can find models and vectorizers: queryclassifier.zip
Training details
- Colab instead of Kaggle to use python3.8 (and latest scikit-learn version)
- scikit-learn==1.2.0
- pickle protocol 4: compatible with python>=3.4
- same exact results as the original models
Compatibility with scikit-learn versions
The trained models are:
- (obviously) compatible with 1.2.0
- compatible with >=1.0.0 if we slightly improve the current monkey patch as follows:
self.model = pickle.load(urllib.request.urlopen(model_name_or_path))
# MONKEY PATCH to support different versions of scikit-learn
# see https://github.com/deepset-ai/haystack/issues/2904
if isinstance(self.model, GradientBoostingClassifier):
if not hasattr(self.model, "_loss"):
self.model._loss = BinomialDeviance(2)
if not hasattr(self.model, "loss_"):
self.model.loss_ = BinomialDeviance(2)
- incompatible with <1.0.0
The issue of compatibility with scikit-learn is not trivial in itself and we can expect some changes in the future. For this reason, as mentioned, I would look carefully at the evolution of the skops project, which promises to alleviate some problems.
In any case, I am available for further clarification. If we decide to improve the monkey patch, I can do it! @julian-risch WDYT?
@anakin87 Sounds very good to me. If you want, I can take care of uploading the models to the same S3 bucket as the old models. You can assign me as a reviewer for your PR (improving the current monkey patch).
@julian-risch please upload the models to the S3 bucket.
I'm preparing the PR...
Let's close this issue now that https://github.com/deepset-ai/haystack/pull/3732 is merged. In future, with new scikit-learn versions, we might need to check this again.