feature_engine icon indicating copy to clipboard operation
feature_engine copied to clipboard

Fix repeated features for OneHotEncoder and PolynomialFeatures

Open datacubeR opened this issue 3 years ago • 7 comments

Hi @solegalli, This is my shot for fixing #489. After checking the code in detail I think the issue affects not only PolynomialFeatures but also Sklearn OneHotEncoder. When using SklearnTransformerWrapper + OneHotEncoder I get this:

image

Which is totally unexpected, since the idea is to replace Categorical Features for OneHotEncoded ones.

After the fix I'm proposing I get this for OneHotEncoder image

Which I think is the expected behavior.

For PolynomialFeatures I get the following: image

Please note that an OrdinalEncoder was applied to Categorical Variables before applying PolynomialFeatures, otherwise it throws an error since they are not numerical features.

I think these cases could be added as test cases if you think it's OK. But before I would love to get your feedback.

Best,

Alfonso

datacubeR avatar Aug 04 '22 03:08 datacubeR

Hi @datacubeR

Thanks a lot for the changes.

I do agree that the idea is to replace the original categorical variables by the one hot encoded ones. So I guess, it is OK to have them removed from the dataset as you did.

Originally I thought, if that is the intended functionality, then why not use the OHE from feature engine instead of the one from sklearn? that is why I decided not to drop them. But I think it makes sense to go ahead with your suggestion.

I think it would be great to have those code snippets as tests, and also maybe in the user guide: https://feature-engine.readthedocs.io/en/latest/user_guide/wrappers/Wrapper.html#sklearn-wrapper (I mean examples of wrapping the OHE and the PolynomialFeatures)

would you be able to do that?

thanks a lot!

solegalli avatar Aug 06 '22 14:08 solegalli

PS: I fixed the failing tests in the main branch last week. It was something related to the boxcox with the latest version of scipy. If you sync your main branch and then rebase it to this feature branch, it should get those sorted @datacubeR

thank you!

solegalli avatar Aug 06 '22 14:08 solegalli

@solegalli I think my fork is up to date, but the failing tests is because they considering adding Features rather than replacing the existing feature for the transformed ones. I will fix those and I'll get back to you.

datacubeR avatar Aug 06 '22 20:08 datacubeR

@solegalli do I need to create a separate PR for the examples in the User Guide or I just include those in this one?

datacubeR avatar Aug 07 '22 01:08 datacubeR

parate PR for the exampl

Here would be good :)

Thank you!

solegalli avatar Aug 07 '22 08:08 solegalli

It's looking good.

I think we need to add a test to corroborate that the method get_feature_names_out returns the original features, without the ones used in the transformation, plus the new ones, when it is called without passing the input_feature parameters.

so tr_wrap.get_feature_names_out() should return = feature_names_in - self.variables_ + new features

Could you add a test for that please? Or is it already there? I mean just for the Poly and OHE

solegalli avatar Aug 07 '22 08:08 solegalli

Hey @datacubeR

This is almost ready. Some small changes here and there. Would you be able to have a look?

Thank you!

solegalli avatar Aug 13 '22 12:08 solegalli

Hi @datacubeR

Thank you for looking into the get_feature_names_out() issue. I think we can ignore that request then.

Let me know when this is good to go.

Thanks a lot

solegalli avatar Aug 18 '22 09:08 solegalli

Hi @solegalli, sorry for the delay. I updated my OS and had to reinstall lot of things. I just pushed all the last observations for your review. Please let me know if everything looks ok.

Best,

Alfonso

datacubeR avatar Aug 25 '22 01:08 datacubeR

Thank you @datacubeR !!

Great fix.

solegalli avatar Aug 30 '22 09:08 solegalli