added option to use sklearn's OneHotEncoder to handle unknown categories
This library is amazing and I noticed a small issue when using the Multiple Correspondence Analysis: since the function uses pd.get_dummies internally to one hot encode the data, I got an error as my testing set had unknown categories in certain categorical features compared to the train set.
Therefore, I have initialized a OneHotEncoder object from sklearn.preprocessing to process the data, if the user wants to opt out of using the get_dummies function.
These are the three attributes that I have specified:
- get_dummies (if True, will use the original get_dummies method (default is set to False))
- one_hot_encoder (the OneHotEncoder object)
- is_one_hot_fitted: (boolean to check if the one_hot_encoder is fitted)
I have updated the _prepare function as well:
def _prepare(self, X):
if self.one_hot:
if self.get_dummies:
X = pd.get_dummies(X, columns=X.columns)
return X
else:
if self.is_one_hot_fitted == False:
X_enc = self.one_hot_encoder.fit_transform(X)
X_enc = pd.DataFrame(X_enc, columns=self.one_hot_encoder.get_feature_names_out(X.columns))
self.is_one_hot_fitted = True
return X_enc
else:
X_enc = self.one_hot_encoder.transform(X)
X_enc = pd.DataFrame(X_enc, columns=self.one_hot_encoder.get_feature_names_out(X.columns))
return X_enc
return X
Let me know if there is anything else I can do, or whether the workings are correct.
Thanks again for this great library <3
Thanks for starting this PR! This is a tricky topic. Have you tried running the unit tests? I think this will will fail due to supplementary columns... I have booked some time on my calendar to look into this. I'll let you know.
And thanks for the appreciation :)
Thanks for starting this PR! This is a tricky topic. Have you tried running the unit tests? I think this will will fail due to supplementary columns... I have booked some time on my calendar to look into this. I'll let you know.
Hi, thank you. I didn't try the unit tests, and as you said, the unit tests are failing. Please let me know if there is anything that I can do, and also, may I know the reason for having supplementary columns?
I modified the mca file to handle unknown features, as the error in the unit test is the features that are seen in fit are not seen when transforming, so I modified the _prepare function in mca.py:
def _prepare(self, X):
if self.one_hot:
if self.get_dummies:
X = pd.get_dummies(X, columns=X.columns)
return X
else:
if self.is_one_hot_fitted == False:
#if the one_hot_encoder is not fitted, to fit and also set the is_one_hot_fitted variable to True
X_enc = self.one_hot_encoder.fit_transform(X)
X_enc = pd.DataFrame(X_enc, columns=self.one_hot_encoder.get_feature_names_out(X.columns))
self.is_one_hot_fitted = True
return X_enc
else:
#checking if the columns fed to the onehot encoder and the columns fitted to the onehot encoder are the same
oh_cols = set(self.one_hot_encoder.feature_names_in_.tolist())
X_cols = set(X.columns.tolist())
if oh_cols == X_cols:
#if the fitted cols are the same as the inferencing columns, then can transform
X_enc = self.one_hot_encoder.transform(X)
X_enc = pd.DataFrame(X_enc, columns=self.one_hot_encoder.get_feature_names_out(X.columns))
return X_enc
else:
#if the fitted cols are different to the inferencing columns, then should fit the onehot encoder again, to handle unit tests
print(X_cols)
print(oh_cols)
X_enc = self.one_hot_encoder.fit_transform(X)
X_enc = pd.DataFrame(X_enc, columns=self.one_hot_encoder.get_feature_names_out(X.columns))
return X_enc
return X
I checked with the unit tests and didn't have issues on my side. please let me know if this works.
Ok thanks for looking it. I will take a good look! I want to also make sure this change you're bringing resolves this issue.
Ok thanks for looking it. I will take a good look! I want to also make sure this change you're bringing resolves this issue.
Sure, thank you. Saw the error clean code test, and made a change.
Hi @MaxHalford, is there any update to this?
Hey @Vaseekaran-V! I finally found carved some time to look into this. Turns out I found a simpler solution in #181