calibrate() doesn't work if the corpus is just in 2 files
If we have train_corpus in 2 files (author1_-title.txt, author2-_title.txt) than calibrate(train_corpus) will drop an error:
calibrate(train_corpus)
lib/python3.10/dist-packages/numpy/core/_methods.py:265: RuntimeWarning: Degrees of freedom <= 0 for slice
ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
/usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:257: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
/usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:265: RuntimeWarning: Degrees of freedom <= 0 for slice
ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
/usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:257: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
[<ipython-input-26-cc70d17a9b30>](https://localhost:8080/#) in <cell line: 1>()
----> 1 calibrate(train_corpus)
5 frames
[/usr/local/lib/python3.10/dist-packages/faststylometry/probability.py](https://localhost:8080/#) in calibrate(corpus, model)
77 ground_truths, delta_values = get_calibration_curve(corpus)
78
---> 79 model.fit(np.reshape(delta_values, (-1, 1)), ground_truths)
80
81 corpus.probability_model = model
[/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py](https://localhost:8080/#) in fit(self, X, y, sample_weight)
1194 _dtype = [np.float64, np.float32]
1195
-> 1196 X, y = self._validate_data(
1197 X,
1198 y,
[/usr/local/lib/python3.10/dist-packages/sklearn/base.py](https://localhost:8080/#) in _validate_data(self, X, y, reset, validate_separately, **check_params)
582 y = check_array(y, input_name="y", **check_y_params)
583 else:
--> 584 X, y = check_X_y(X, y, **check_params)
585 out = X, y
586
[/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py](https://localhost:8080/#) in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
1104 )
1105
-> 1106 X = check_array(
1107 X,
1108 accept_sparse=accept_sparse,
[/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py](https://localhost:8080/#) in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
919
920 if force_all_finite:
--> 921 _assert_all_finite(
922 array,
923 input_name=input_name,
[/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py](https://localhost:8080/#) in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
159 "#estimators-that-handle-nan-values"
160 )
--> 161 raise ValueError(msg_err)
162
163
ValueError: Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values
Hi @samvelkoch, Thank you for the feedback and apologies for the late reply. The reason why this is not possible is that in order to calibrate Burrows Delta versions into probabilities we need:
- at least one example of two documents by the same author (a "positive example")
- at least one example of two documents by different authors (a "negative example")
Ideally you would have several positive and several negative examples.
These conditions cannot be met with only two documents. Because the calibration needs to know how similar documents by the same author will look in Burrows Delta. I suppose in some sense it would be possible to make a default calibration so that we always translate delta = 1 to probability = 0.5, etc, but this would be taking on a bias from a particular dataset.
You could also use the calibration that comes out of the example docs provided with the library.
Perhaps someone can implement such a feature? I will leave it to the community but if I get time I will investigate