calibrate() doesn't work if the corpus is just in 2 files

Open samvelkoch opened this issue 2 years ago • 1 comments

If we have train_corpus in 2 files (author1_-title.txt, author2-_title.txt) than calibrate(train_corpus) will drop an error:

calibrate(train_corpus)

lib/python3.10/dist-packages/numpy/core/_methods.py:265: RuntimeWarning: Degrees of freedom <= 0 for slice
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
/usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:257: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:265: RuntimeWarning: Degrees of freedom <= 0 for slice
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
/usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:257: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-26-cc70d17a9b30>](https://localhost:8080/#) in <cell line: 1>()
----> 1 calibrate(train_corpus)

5 frames
[/usr/local/lib/python3.10/dist-packages/faststylometry/probability.py](https://localhost:8080/#) in calibrate(corpus, model)
     77     ground_truths, delta_values = get_calibration_curve(corpus)
     78 
---> 79     model.fit(np.reshape(delta_values, (-1, 1)), ground_truths)
     80 
     81     corpus.probability_model = model

[/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py](https://localhost:8080/#) in fit(self, X, y, sample_weight)
   1194             _dtype = [np.float64, np.float32]
   1195 
-> 1196         X, y = self._validate_data(
   1197             X,
   1198             y,

[/usr/local/lib/python3.10/dist-packages/sklearn/base.py](https://localhost:8080/#) in _validate_data(self, X, y, reset, validate_separately, **check_params)
    582                 y = check_array(y, input_name="y", **check_y_params)
    583             else:
--> 584                 X, y = check_X_y(X, y, **check_params)
    585             out = X, y
    586 

[/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py](https://localhost:8080/#) in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
   1104         )
   1105 
-> 1106     X = check_array(
   1107         X,
   1108         accept_sparse=accept_sparse,

[/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py](https://localhost:8080/#) in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    919 
    920         if force_all_finite:
--> 921             _assert_all_finite(
    922                 array,
    923                 input_name=input_name,

[/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py](https://localhost:8080/#) in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
    159                 "#estimators-that-handle-nan-values"
    160             )
--> 161         raise ValueError(msg_err)
    162 
    163 

ValueError: Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

Nov 26 '23 20:11 samvelkoch

Hi @samvelkoch, Thank you for the feedback and apologies for the late reply. The reason why this is not possible is that in order to calibrate Burrows Delta versions into probabilities we need:

at least one example of two documents by the same author (a "positive example")
at least one example of two documents by different authors (a "negative example")

Ideally you would have several positive and several negative examples.

These conditions cannot be met with only two documents. Because the calibration needs to know how similar documents by the same author will look in Burrows Delta. I suppose in some sense it would be possible to make a default calibration so that we always translate delta = 1 to probability = 0.5, etc, but this would be taking on a bias from a particular dataset.

You could also use the calibration that comes out of the example docs provided with the library.

Perhaps someone can implement such a feature? I will leave it to the community but if I get time I will investigate

Jul 21 '25 08:07 woodthom2