scikit-learn Yeo-Johnson Power Transformer gives Numpy warning

Describe the bug

When I use a power transformer with yeo-johnson method I get this warning in numpy:

../lib/python3.10/site-packages/numpy/core/_methods.py:235: RuntimeWarning: overflow encountered in multiply

Strangely I can't surpress this warning with filter.warnings()

Tl;dr: The eroor seems to appear when multiple instances of the transformer are called, in this example because n_jobs=2 is set in the preprocessor. But the bug also appears with n_jobs=1 when you call this preprocessor in a grid search for example.

Steps/Code to Reproduce

The following code snippet will raise a warning:

import numpy as np
from sklearn.preprocessing import PowerTransformer

rng = np.random.default_rng(0)
X = rng.exponential(size=(50,))
X -= X.max()
X = -X
X += 100
X = X.reshape(-1, 1)

method = "yeo-johnson"
transformer = PowerTransformer(method=method, standardize=False)
transformer.fit_transform(X)

Expected Results

No Warnings

Actual Results

/Users/glemaitre/mambaforge/envs/dev/lib/python3.8/site-packages/numpy/core/_methods.py:233: RuntimeWarning: overflow encountered in multiply
  x = um.multiply(x, x, out=x)

Versions

System:
    python: 3.8.12 | packaged by conda-forge | (default, Sep 16 2021, 01:38:21)  [Clang 11.1.0 ]
executable: /Users/glemaitre/mambaforge/envs/dev/bin/python
   machine: macOS-12.3.1-arm64-arm-64bit

Python dependencies:
      sklearn: 1.2.dev0
          pip: 21.3
   setuptools: 58.2.0
        numpy: 1.21.6
        scipy: 1.8.0
       Cython: 0.29.24
       pandas: 1.4.2
   matplotlib: 3.4.3
       joblib: 1.0.1
threadpoolctl: 2.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /Users/glemaitre/mambaforge/envs/dev/lib/libopenblas_vortexp-r0.3.18.dylib
        version: 0.3.18
threading_layer: openmp
   architecture: VORTEX
    num_threads: 8

       user_api: openmp
   internal_api: openmp
         prefix: libomp
       filepath: /Users/glemaitre/mambaforge/envs/dev/lib/libomp.dylib
        version: None
    num_threads: 8

May 10 '22 09:05 nilslacroix

Could you provide the data such that we can reproduce the use case and check what involve trigger this overflow

May 10 '22 10:05 glemaitre

This is not a future version warning. If you can provide the dummy data so that we can look in it because the function is working as it was supposed.

May 10 '22 10:05 AliHaider20

I can provide this screenshot if this is good?

Also for the dtypes:

Nightlife            float64
Education            float64
Security             float64
Nature               float64
Eating               float64
Balcony                int64
DailyNeeds           float64
Tourism              float64
Leisure              float64
YearBuilt            float64
YearModernization    float64
dtype: object

I could also provide a small sample dataset if it is possible to upload it here?

May 10 '22 11:05 nilslacroix

You can upload it here. You can provide the smallest sample until it triggers the warning.

May 10 '22 11:05 glemaitre

dataset.zip This is the minimum size where I could reproduce the error. Strangely enough it does not happen with smaller datasets, maybe this is linked to samplesize.

May 10 '22 11:05 nilslacroix

The problem seems to be within column "YearModernization".

The warning does not appear when I use the power transformer outside of the pipeline:

So this works:

pt = PowerTransformer(method="yeo-johnson")
pt.fit(pd.DataFrame(X_train["YearModernization"]))

May 10 '22 12:05 nilslacroix

But there is no parameter n_jobs in PowerTransformer, isn't it?

May 10 '22 12:05 glemaitre

Sorry the parameter is in the preprocessor pipeline.

May 10 '22 12:05 nilslacroix

Also if you call this preprocessor with a gridsearch and even when preprocessor stays at n_jobs=1, the same error appears. So it my guess is that when multiple fits are called on this Transformer with cv=5 for example something breaks.

May 10 '22 12:05 nilslacroix

Could you provide the version of pandas used as well to be able to unpickle the dataset.

May 10 '22 12:05 glemaitre

Basically the following:

In [1]: import sklearn
In [2]: sklearn.show_versions()

May 10 '22 12:05 glemaitre

System: python: 3.10.4 | packaged by conda-forge | (main, Mar 24 2022, 17:38:57) [GCC 10.3.0] executable: /home/xxxxx/anaconda3/envs/MASTER/bin/python machine: Linux-4.15.0-154-generic-x86_64-with-glibc2.27

Python dependencies: sklearn: 1.2.dev0 pip: 22.0.4 setuptools: 62.1.0 numpy: 1.22.3 scipy: 1.8.0 Cython: 0.29.28 pandas: 1.4.2 matplotlib: 3.5.2 joblib: 1.1.0 threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info: user_api: openmp internal_api: openmp prefix: libgomp filepath: /home/xxx/anaconda3/envs/MASTER/lib/python3.10/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0 version: None num_threads: 56

   user_api: blas

internal_api: openblas prefix: libopenblas filepath: /home/xxx/anaconda3/envs/MASTER/lib/libopenblasp-r0.3.20.so version: 0.3.20 threading_layer: pthreads architecture: Haswell num_threads: 56

   user_api: openmp

internal_api: openmp prefix: libgomp filepath: /home/xxx/anaconda3/envs/MASTER/lib/libgomp.so.1.0.0 version: None num_threads: 56

May 10 '22 12:05 nilslacroix

This maybe explains why the warnings can not be turned off, with filter.warnings when the setting only applies to the main thread/process, but the problem is caused by some processes/threads which are created in parallel.

May 10 '22 12:05 nilslacroix

So the warning is raised by the following line:

https://github.com/scikit-learn/scikit-learn/blob/b0b8a39d8bb80611398e4c57895420d5cb1dfe09/sklearn/preprocessing/_data.py#L3244-L3246

At some point x_trans is equal to the following:

[1.50645674e+206 1.82014521e+206 2.06412322e+206 ... 3.00589049e+206
 2.41473577e+206 7.49229704e+205]

and computing the variance will overflow.

We should probably prevent to compute the variance and directly return np.inf.

May 10 '22 13:05 glemaitre

The problem is indeed trigger by YearModernisation. The distribution is the following:

Somehow, I feel that it could be linked to the following issue: https://github.com/scikit-learn/scikit-learn/issues/14959 The PR https://github.com/scikit-learn/scikit-learn/pull/20653 was maybe not enough to solve the problem.

ping @thomasjpfan @ogrisel @jeremiedbb that have a look at the original issue.

May 10 '22 13:05 glemaitre

~~maybe we could make a "safe variance" function by using the fact that V(x) = xmax**2 V(x / xmax) ?~~ EDIT: nvm it will probably not be more stable...

May 10 '22 13:05 jeremiedbb

The square will overflow :)

May 10 '22 13:05 glemaitre

Can someone craft a minimal reproducer with a single column dataset, possibly syntactically generated?

May 10 '22 13:05 ogrisel

But then this code would trigger the warning as well:


pt = PowerTransformer(method="yeo-johnson")
pt.fit(pd.DataFrame(X_train["YearModernization"]))

which it does not. Or am I wrong?

May 10 '22 13:05 nilslacroix

Can someone craft a minimal reproducer with a single column dataset, possibly syntactically generated?

Here is a minimal reproducer, but you have to use this dataset:

dataset.zip

from joblib import load
import pandas as pd
from sklearn.preprocessing import PowerTransformer


data = load("X_train.joblib")
data = pd.DataFrame(data["YearModernization"])

pow_features = ["YearModernization"]

#Method 1 - Works fine
pt = PowerTransformer(method="yeo-johnson")
pt.fit(data)
print("This has worked without error")

#Method 2 - Error warning because n_jobs>1

power_transformer = Pipeline(steps=[('pow', PowerTransformer(method="yeo-johnson"))])

# Preprocessing the dataframe
preprocessor = ColumnTransformer(
    transformers=
    [   
       ("pow", power_transformer, pow_features)
    ], n_jobs=-1
)
# Pipeline for fitting
preprocessor.fit(data)
print("This produces an error if n_jobs is set >= 1")

May 10 '22 13:05 nilslacroix

I could make a smaller reproducer with synthetic data that I added in the original discussion.

In addition, if we increase the shift toward positive values, the optimization will always return a log-likelihood of np.inf and thus return infinite values. Here the fact of using standardize=True will hide extremely large values after the transform.

I think that we should detect this problem and maybe return the original data when we cannot transform properly the dataset. Indeed, it seems that the exponential distribution is not well suited for this transformer.

May 13 '22 15:05 glemaitre

I ran into this as well. Here is a minimal example that reproduces the problem:

import numpy as np
from sklearn.preprocessing import PowerTransformer

x = np.array([2003.0, 1950.0, 1997.0, 2000.0, 2009.0, 2009.0, 1980.0, 1999.0, 2007.0, 1991.0])

# [Issue #1]
# RuntimeWarning: overflow encountered in multiply from `x_trans_var = x_trans.var()`
PowerTransformer(method="yeo-johnson", standardize=True).fit_transform(x[:, np.newaxis])

# [Issue #2]
# RuntimeWarning: overflow encountered in power from `out[pos] = (np.power(x[pos] + 1, lmbda) - 1) / lmbda`
PowerTransformer(method="yeo-johnson", standardize=True).fit_transform(x[:5, np.newaxis])

Mar 11 '23 15:03 lsorber

I think there are two distinct problems to handle here:

Make the expressions that use np.power more robust against overflow.
Avoid overflow in x_trans.var() by regularizing the output's exponent. Without regularization, the output's exponent can blow up for certain inputs for very little gain in the loss function.

Here is a patch that addresses both of the above issues:

import numpy as np
from scipy import optimize
from sklearn.preprocessing import PowerTransformer as OriginalPowerTransformer


class PowerTransformer(OriginalPowerTransformer):
    """Apply a power transform featurewise to make data more Gaussian-like."""

    def _yeo_johnson_inverse_transform(self, x, lmbda):
        """Return inverse-transformed input x following Yeo-Johnson inverse transform."""
        x_inv = np.zeros_like(x)
        pos = x >= 0

        # when x >= 0
        if abs(lmbda) < np.spacing(1.0):
            x_inv[pos] = np.exp(x[pos]) - 1
        else:  # lmbda != 0
            # x_inv[pos] = np.power(x[pos] * lmbda + 1, 1 / lmbda) - 1  # Original line
            x_inv[pos] = np.exp(np.log1p(x[pos] * lmbda) / lmbda) - 1

        # when x < 0
        if abs(lmbda - 2) > np.spacing(1.0):
            # x_inv[~pos] = 1 - np.power(-(2 - lmbda) * x[~pos] + 1, 1 / (2 - lmbda))  # Original line
            x_inv[~pos] = 1 - np.exp(np.log1p(-(2 - lmbda) * x[~pos]) / (2 - lmbda))
        else:  # lmbda == 2
            x_inv[~pos] = 1 - np.exp(-x[~pos])

        return x_inv

    def _yeo_johnson_transform(self, x, lmbda):
        """Return transformed input x following Yeo-Johnson transform."""
        out = np.zeros_like(x)
        pos = x >= 0  # binary mask

        # when x >= 0
        if abs(lmbda) < np.spacing(1.0):
            out[pos] = np.log1p(x[pos])
        else:  # lmbda != 0
            # out[pos] = (np.power(x[pos] + 1, lmbda) - 1) / lmbda  # Original line
            out[pos] = (np.exp(np.log1p(x[pos]) * lmbda) - 1) / lmbda

        # when x < 0
        if abs(lmbda - 2) > np.spacing(1.0):
            # out[~pos] = -(np.power(-x[~pos] + 1, 2 - lmbda) - 1) / (2 - lmbda)  # Original line
            out[~pos] = -(np.exp(np.log1p(-x[~pos]) * (2 - lmbda)) - 1) / (2 - lmbda)
        else:  # lmbda == 2
            out[~pos] = -np.log1p(-x[~pos])

        return out

    def _yeo_johnson_optimize(self, x):
        """Find and return optimal lambda parameter of the Yeo-Johnson transform."""
        x_tiny = np.finfo(np.float64).tiny

        def _neg_log_likelihood(lmbda):
            """Compute the negative log likelihood of the observed data x."""
            x_trans = self._yeo_johnson_transform(x, lmbda)
            x_trans_var = x_trans.var()

            # Reject transformed data that would raise a RuntimeWarning in np.log
            if x_trans_var < x_tiny:
                return np.inf

            log_var = np.log(x_trans_var)
            # loglike = -n_samples / 2 * log_var  # Original line
            # loglike += (lmbda - 1) * (np.sign(x) * np.log1p(np.abs(x))).sum()  # Original line
            loglike = -0.5 * log_var
            loglike += (lmbda - 1) * (np.sign(x) * np.log1p(np.abs(x))).mean()
            loglike -= 0.01 * np.mean(np.abs(np.log(np.abs(x_trans_var[x_trans_var != 0]))))  # Regularize the exponent

            return -loglike

        # the computation of lambda is influenced by NaNs so we need to
        # get rid of them
        x = x[~np.isnan(x)]
        # choosing bracket -2, 2 like for boxcox
        lmbda = cast(float, optimize.brent(_neg_log_likelihood, brack=(-2, 2)))
        return lmbda

Mar 13 '23 08:03 lsorber

Thanks @lsorber. Would you be interested in opening a PR with non regression tests based on https://github.com/scikit-learn/scikit-learn/issues/23319#issuecomment-1464933635?

Apr 13 '23 08:04 ogrisel

@ogrisel Here's a PR that addresses both types of overflow and includes a non-regression test: https://github.com/scikit-learn/scikit-learn/pull/26188

Apr 15 '23 16:04 lsorber

Is there an ETA for the fix?

I'm hitting this problem on a completely different dataset. What's worse, I cannot seem to find any way to turn off the warnings in my Jupyter notebook. I've tried warnings, logging, the %%capture directive - this message seems to bypass everything. :( And with cross-validation, the message gets spammed over many screens of text.

This bypass of warning filtering might be a separate bug on its own, I'm not sure yet.

EDIT: X dataframe that triggers the bug (one column needs to be imputed with the mean value):

X_features_numeric.zip

Jul 21 '23 19:07 FlorinAndrei

@FlorinAndrei Current status: I created a PR that fixed the issue upstream in SciPy (https://github.com/scipy/scipy/issues/18389), which is nearing completion. After that, I'll create a PR for sklearn that uses SciPy's improved implementation.

Jul 25 '23 08:07 lsorber

#26308 together with the upcoming scipy 1.12 will resolve this issue.

Sep 14 '23 09:09 lorentzenchr