feature_engine icon indicating copy to clipboard operation
feature_engine copied to clipboard

Polynomial Feaatures + SklearnWrapper weird behavior

Open datacubeR opened this issue 3 years ago • 3 comments

Describe the bug A clear and concise description of what the bug is.

When using PolynomialFeaturs + SklearnWrappers the base features are duplicated, when trying to dedup using DropDuplicateFeatures the values are repeated again!!

Using the simple Titanic Dataset you can run something like this:

df = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
X = df[['pclass','sex','age','fare','embarked']]
y = df.survived

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
pipe = Pipeline(steps = [
    ('ci', CategoricalImputer(imputation_method='frequent')),
    ('mmi', MeanMedianImputer(imputation_method='mean')),
    ('od', OrdinalEncoder(encoding_method='arbitrary')),
    ('pl', SklearnTransformerWrapper(PolynomialFeatures(degree = 2, interaction_only = True, include_bias=False), variables=['pclass','sex'])),
    #('drop', DropDuplicateFeatures()),
    #('sc', SklearnTransformerWrapper(StandardScaler(), variables=['Age','Fare'])),
    #('lr', LogisticRegression(random_state=42))

])
pipe.fit_transform(X_train)

This returns the first issue:

image

I'm getting pclass and sex duplicated, I'm expecting to get back only interactions. This is expected from Sklearn Docs but why would I want duplicated features?

Looking into Feature Engine Docs I found DropDuplicateFeatures(), but if applying into the Pipeline I get this:

df = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
X = df[['pclass','sex','age','fare','embarked']]
y = df.survived

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
pipe = Pipeline(steps = [
    ('ci', CategoricalImputer(imputation_method='frequent')),
    ('mmi', MeanMedianImputer(imputation_method='mean')),
    ('od', OrdinalEncoder(encoding_method='arbitrary')),
    ('pl', SklearnTransformerWrapper(PolynomialFeatures(degree = 2, interaction_only = True, include_bias=False), variables=['pclass','sex'])),
    ('drop', DropDuplicateFeatures()),
    #('sc', SklearnTransformerWrapper(StandardScaler(), variables=['Age','Fare'])),
    #('lr', LogisticRegression(random_state=42))

])
pipe.fit_transform(X_train)

image

getting tons of repeated features, which is totally unexpected.

Expected behavior Not getting repeated/duplicated features.

Screenshots Shown above.

Desktop (please complete the following information):

  • Ubuntu 20.04
  • Feature Engine 1.4.0

Thanks in Advance,

Alfonso

datacubeR avatar Jul 21 '22 19:07 datacubeR

Hi @datacubeR

Thank you for raising the issue. Apologies for the delay. I was on holidays until yesterday.

To check if I understand this correctly, the first pipeline should return these columns: ['age', 'fare', 'embarked', 'pclass', 'sex', 'pclass sex']

but it is instead returning this columns: ['pclass', 'sex', 'age', 'fare', 'embarked', 'pclass', 'sex', 'pclass sex']

The DropDuplicates() would not identify the duplicated variables, because the original inputs are integers and the returned by the PolynomialFeatures are floats (it evaluates the variable values and not he variable names). The funny output is then returned by pandas because there are duplicated variable names, yet, they have different values. So in this case DropDuplicates is not to blame, it is pandas ;)

This is a fairly straightforward fix. We need to remove the duplicated variables in the transform method, probably here: https://github.com/feature-engine/feature_engine/blob/60a70455465be305e9b9e1b3fc17dfbcd1ca2ae8/feature_engine/wrappers/wrappers.py#L281-L285

the duplicated variables are in essence the input features.

And the thing that is slightly more complicated, is to adjust the output of the method get_feature_names_out() to display the right features.

We need to separate this block of code in 2:

https://github.com/feature-engine/feature_engine/blob/60a70455465be305e9b9e1b3fc17dfbcd1ca2ae8/feature_engine/wrappers/wrappers.py#L369-L378

the first one, as is, only valid for the OneHotEncoder, and then a second one, where we need to adapt to correctly display the feature names if wrapping the PolynomialFeatures.

Would you like to give it a go at fixing the class?

solegalli avatar Aug 02 '22 13:08 solegalli

Hi @solegalli, I would love to give it a try. The thing I would like to understand better is that it seems you think the problem is the SklearnTransformerWrapper not the PolynomialFeatures nor the DropDuplicatesFeatures. On that note, do you thing is an expected behavior to not detect Duplicate Features when the only difference is DataType?

Thanks and I'll keep you posted!!

datacubeR avatar Aug 03 '22 16:08 datacubeR

Yes. The PolynomialFeatures from sklearn is designed to just return the polynomial features. It operates over the entire dataset. As such, the result will have the original features, which are the features exponentiated to 1.

The SklearnTransformerWrapper, applies the PolynomialFeatures to a selected group of variables, and the appends the result to the original data, creating the duplication. Thus, we need to fix this issue.

solegalli avatar Aug 03 '22 16:08 solegalli