Polynomial Feaatures + SklearnWrapper weird behavior
Describe the bug A clear and concise description of what the bug is.
When using PolynomialFeaturs + SklearnWrappers the base features are duplicated, when trying to dedup using DropDuplicateFeatures the values are repeated again!!
Using the simple Titanic Dataset you can run something like this:
df = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
X = df[['pclass','sex','age','fare','embarked']]
y = df.survived
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
pipe = Pipeline(steps = [
('ci', CategoricalImputer(imputation_method='frequent')),
('mmi', MeanMedianImputer(imputation_method='mean')),
('od', OrdinalEncoder(encoding_method='arbitrary')),
('pl', SklearnTransformerWrapper(PolynomialFeatures(degree = 2, interaction_only = True, include_bias=False), variables=['pclass','sex'])),
#('drop', DropDuplicateFeatures()),
#('sc', SklearnTransformerWrapper(StandardScaler(), variables=['Age','Fare'])),
#('lr', LogisticRegression(random_state=42))
])
pipe.fit_transform(X_train)
This returns the first issue:

I'm getting pclass and sex duplicated, I'm expecting to get back only interactions. This is expected from Sklearn Docs but why would I want duplicated features?
Looking into Feature Engine Docs I found DropDuplicateFeatures(), but if applying into the Pipeline I get this:
df = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
X = df[['pclass','sex','age','fare','embarked']]
y = df.survived
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
pipe = Pipeline(steps = [
('ci', CategoricalImputer(imputation_method='frequent')),
('mmi', MeanMedianImputer(imputation_method='mean')),
('od', OrdinalEncoder(encoding_method='arbitrary')),
('pl', SklearnTransformerWrapper(PolynomialFeatures(degree = 2, interaction_only = True, include_bias=False), variables=['pclass','sex'])),
('drop', DropDuplicateFeatures()),
#('sc', SklearnTransformerWrapper(StandardScaler(), variables=['Age','Fare'])),
#('lr', LogisticRegression(random_state=42))
])
pipe.fit_transform(X_train)

getting tons of repeated features, which is totally unexpected.
Expected behavior Not getting repeated/duplicated features.
Screenshots Shown above.
Desktop (please complete the following information):
- Ubuntu 20.04
- Feature Engine 1.4.0
Thanks in Advance,
Alfonso
Hi @datacubeR
Thank you for raising the issue. Apologies for the delay. I was on holidays until yesterday.
To check if I understand this correctly, the first pipeline should return these columns: ['age', 'fare', 'embarked', 'pclass', 'sex', 'pclass sex']
but it is instead returning this columns: ['pclass', 'sex', 'age', 'fare', 'embarked', 'pclass', 'sex', 'pclass sex']
The DropDuplicates() would not identify the duplicated variables, because the original inputs are integers and the returned by the PolynomialFeatures are floats (it evaluates the variable values and not he variable names). The funny output is then returned by pandas because there are duplicated variable names, yet, they have different values. So in this case DropDuplicates is not to blame, it is pandas ;)
This is a fairly straightforward fix. We need to remove the duplicated variables in the transform method, probably here: https://github.com/feature-engine/feature_engine/blob/60a70455465be305e9b9e1b3fc17dfbcd1ca2ae8/feature_engine/wrappers/wrappers.py#L281-L285
the duplicated variables are in essence the input features.
And the thing that is slightly more complicated, is to adjust the output of the method get_feature_names_out() to display the right features.
We need to separate this block of code in 2:
https://github.com/feature-engine/feature_engine/blob/60a70455465be305e9b9e1b3fc17dfbcd1ca2ae8/feature_engine/wrappers/wrappers.py#L369-L378
the first one, as is, only valid for the OneHotEncoder, and then a second one, where we need to adapt to correctly display the feature names if wrapping the PolynomialFeatures.
Would you like to give it a go at fixing the class?
Hi @solegalli,
I would love to give it a try. The thing I would like to understand better is that it seems you think the problem is the SklearnTransformerWrapper not the PolynomialFeatures nor the DropDuplicatesFeatures.
On that note, do you thing is an expected behavior to not detect Duplicate Features when the only difference is DataType?
Thanks and I'll keep you posted!!
Yes. The PolynomialFeatures from sklearn is designed to just return the polynomial features. It operates over the entire dataset. As such, the result will have the original features, which are the features exponentiated to 1.
The SklearnTransformerWrapper, applies the PolynomialFeatures to a selected group of variables, and the appends the result to the original data, creating the duplication. Thus, we need to fix this issue.