haystack icon indicating copy to clipboard operation
haystack copied to clipboard

SklearnQueryClassifier

Open jstremme opened this issue 3 years ago • 2 comments

Describe the bug When using the SklearnQueryClassifier in an attempt to classify sentences, I get an error.

Error message AttributeError: 'GradientBoostingClassifier' object has no attribute '_loss'.

Expected behavior Would expect to be able to load and run inference with the query classifier without issue.

To Reproduce

!pip install -q farm-haystack==1.6.0
from haystack.pipeline import SklearnQueryClassifier
gbt_clf = SklearnQueryClassifier()
gbt_clf.run(query="Is this a question?")

FAQ Check

System:

  • OS: Mac
  • GPU/CPU: CPU
  • Haystack version (commit or version number): 1.6.0
  • DocumentStore: N/A
  • Reader: N/A
  • Retriever: N/A

jstremme avatar Jul 28 '22 02:07 jstremme

Hi @jstremme this looks to happen with scikit-learn version 1.1.1. If you downgrade to pip install scikit-learn==1.0.2 then this classifier should work.

We will look into what needs to be updated to get the newest version of scikit-learn to work.

sjrl avatar Jul 28 '22 08:07 sjrl

Thanks @sjrl, this is a helpful workaround.

jstremme avatar Jul 28 '22 13:07 jstremme

@sjrl @ZanSara I believe that, to solve this issue, we can retrain the SklearnQueryClassifier model using the new version of scikit-learn. This is probably not a permanent solution but I have no better ideas. WDYT?

The original data and the training procedure can be found here: #611

anakin87 avatar Nov 04 '22 08:11 anakin87

Hello @anakin87! The issue you linked is super-long :smile: I have the impression that re-training the model might be a viable solution. I'm surprised it's needed, but I trust your judgement here.

Let us know if it's viable on a consumer machine or if you need additional computing resources.

ZanSara avatar Dec 05 '22 10:12 ZanSara

In the long run, I see these two possible actions:

  • retrain the (2!) scikit-learn models whenever a new release breaks things. See Security & maintainability limitations from scikit-learn docs. (To do this, we should find the exact versions of the dataset and training code. Probably hidden in #611)

  • understand if the skops project can help us: for the moment, they allow pushing a scikit-learn model to HF Hub and are providing a safer alternative to pickle. Let's see if they introduce something for better compatibility of models across different scikit-learn versions.

Meanwhile, I'm preparing a monkey patch to solve the problem for now.

anakin87 avatar Dec 06 '22 22:12 anakin87

@jstremme We just merged a PR that patches the issue thanks to @anakin87 . Please let us know if the issue remains, thank you.

julian-risch avatar Dec 07 '22 09:12 julian-risch

After a long and pleasant :smiley: reading of #611, I report the main information.

Keywords vs Questions/Statements (Default)

  • Dataset: Quora question keyword pairs https://www.kaggle.com/stefanondisponibile/quora-question-keyword-pairs
  • Training Notebook: https://www.kaggle.com/shahrukhkhan/question-v-statement-detection?scriptVersionId=63376602

Questions vs. Statements

  • Dataset: SPAADIA v2
    • Raw: http://martinweisser.org/amex_a_corpus.zip
    • Parsed: https://www.kaggle.com/shahrukhkhan/questions-vs-statementsclassificationdataset
  • Training Notebook: https://www.kaggle.com/code/shahrukhkhan/questions-v-statement-gradient-boosting-classifier

When I have some time, I will try to retrain the models with the latest version of scikit-learn...

anakin87 avatar Dec 14 '22 21:12 anakin87

I trained the new models in the following notebooks:

Here you can find models and vectorizers: queryclassifier.zip

Training details

  • Colab instead of Kaggle to use python3.8 (and latest scikit-learn version)
  • scikit-learn==1.2.0
  • pickle protocol 4: compatible with python>=3.4
  • same exact results as the original models

Compatibility with scikit-learn versions

The trained models are:

  • (obviously) compatible with 1.2.0
  • compatible with >=1.0.0 if we slightly improve the current monkey patch as follows:
        self.model = pickle.load(urllib.request.urlopen(model_name_or_path))
        # MONKEY PATCH to support different versions of scikit-learn
        # see https://github.com/deepset-ai/haystack/issues/2904
        if isinstance(self.model, GradientBoostingClassifier):
            if not hasattr(self.model, "_loss"):
                self.model._loss = BinomialDeviance(2)
            if not hasattr(self.model, "loss_"):
                self.model.loss_ = BinomialDeviance(2)
  • incompatible with <1.0.0

The issue of compatibility with scikit-learn is not trivial in itself and we can expect some changes in the future. For this reason, as mentioned, I would look carefully at the evolution of the skops project, which promises to alleviate some problems.

In any case, I am available for further clarification. If we decide to improve the monkey patch, I can do it! @julian-risch WDYT?

anakin87 avatar Dec 14 '22 23:12 anakin87

@anakin87 Sounds very good to me. If you want, I can take care of uploading the models to the same S3 bucket as the old models. You can assign me as a reviewer for your PR (improving the current monkey patch).

julian-risch avatar Dec 19 '22 15:12 julian-risch

@julian-risch please upload the models to the S3 bucket.

I'm preparing the PR...

anakin87 avatar Dec 19 '22 20:12 anakin87

Let's close this issue now that https://github.com/deepset-ai/haystack/pull/3732 is merged. In future, with new scikit-learn versions, we might need to check this again.

julian-risch avatar Dec 20 '22 10:12 julian-risch