prince icon indicating copy to clipboard operation
prince copied to clipboard

added option to use sklearn's OneHotEncoder to handle unknown categories

Open Vaseekaran-V opened this issue 1 year ago • 7 comments

This library is amazing and I noticed a small issue when using the Multiple Correspondence Analysis: since the function uses pd.get_dummies internally to one hot encode the data, I got an error as my testing set had unknown categories in certain categorical features compared to the train set.

Therefore, I have initialized a OneHotEncoder object from sklearn.preprocessing to process the data, if the user wants to opt out of using the get_dummies function.

These are the three attributes that I have specified:

  • get_dummies (if True, will use the original get_dummies method (default is set to False))
  • one_hot_encoder (the OneHotEncoder object)
  • is_one_hot_fitted: (boolean to check if the one_hot_encoder is fitted)

I have updated the _prepare function as well:

def _prepare(self, X):
        if self.one_hot:
            if self.get_dummies:
                X = pd.get_dummies(X, columns=X.columns)
                return X
            else:
                if self.is_one_hot_fitted == False:
                    X_enc = self.one_hot_encoder.fit_transform(X)
                    X_enc = pd.DataFrame(X_enc, columns=self.one_hot_encoder.get_feature_names_out(X.columns))
                    self.is_one_hot_fitted = True
                    return X_enc
                else:
                    X_enc = self.one_hot_encoder.transform(X)
                    X_enc = pd.DataFrame(X_enc, columns=self.one_hot_encoder.get_feature_names_out(X.columns))
                    return X_enc
        return X

Let me know if there is anything else I can do, or whether the workings are correct.

Thanks again for this great library <3

Vaseekaran-V avatar Sep 03 '24 15:09 Vaseekaran-V

Thanks for starting this PR! This is a tricky topic. Have you tried running the unit tests? I think this will will fail due to supplementary columns... I have booked some time on my calendar to look into this. I'll let you know.

MaxHalford avatar Sep 07 '24 18:09 MaxHalford

And thanks for the appreciation :)

MaxHalford avatar Sep 07 '24 18:09 MaxHalford

Thanks for starting this PR! This is a tricky topic. Have you tried running the unit tests? I think this will will fail due to supplementary columns... I have booked some time on my calendar to look into this. I'll let you know.

Hi, thank you. I didn't try the unit tests, and as you said, the unit tests are failing. Please let me know if there is anything that I can do, and also, may I know the reason for having supplementary columns?

Vaseekaran-V avatar Sep 08 '24 07:09 Vaseekaran-V

I modified the mca file to handle unknown features, as the error in the unit test is the features that are seen in fit are not seen when transforming, so I modified the _prepare function in mca.py:

def _prepare(self, X):
        if self.one_hot:
            if self.get_dummies:
                X = pd.get_dummies(X, columns=X.columns)
                return X
            else:
                if self.is_one_hot_fitted == False:
                    #if the one_hot_encoder is not fitted, to fit and also set the is_one_hot_fitted variable to True
                    X_enc = self.one_hot_encoder.fit_transform(X)
                    X_enc = pd.DataFrame(X_enc, columns=self.one_hot_encoder.get_feature_names_out(X.columns))
                    self.is_one_hot_fitted = True
                    return X_enc
                else:
                    #checking if the columns fed to the onehot encoder and the columns fitted to the onehot encoder are the same
                    oh_cols = set(self.one_hot_encoder.feature_names_in_.tolist())
                    X_cols = set(X.columns.tolist())
                    
                    if oh_cols == X_cols:
                        #if the fitted cols are the same as the inferencing columns, then can transform
                        X_enc = self.one_hot_encoder.transform(X)
                        X_enc = pd.DataFrame(X_enc, columns=self.one_hot_encoder.get_feature_names_out(X.columns))
                        return X_enc
                    else:
                        #if the fitted cols are different to the inferencing columns, then should fit the onehot encoder again, to handle unit tests
                        print(X_cols)
                        print(oh_cols)
                        X_enc = self.one_hot_encoder.fit_transform(X)
                        X_enc = pd.DataFrame(X_enc, columns=self.one_hot_encoder.get_feature_names_out(X.columns))
                        return X_enc
        return X

I checked with the unit tests and didn't have issues on my side. please let me know if this works.

Vaseekaran-V avatar Sep 08 '24 07:09 Vaseekaran-V

Ok thanks for looking it. I will take a good look! I want to also make sure this change you're bringing resolves this issue.

MaxHalford avatar Sep 08 '24 15:09 MaxHalford

Ok thanks for looking it. I will take a good look! I want to also make sure this change you're bringing resolves this issue.

Sure, thank you. Saw the error clean code test, and made a change.

Vaseekaran-V avatar Sep 08 '24 16:09 Vaseekaran-V

Hi @MaxHalford, is there any update to this?

Vaseekaran-V avatar Sep 22 '24 14:09 Vaseekaran-V

Hey @Vaseekaran-V! I finally found carved some time to look into this. Turns out I found a simpler solution in #181

MaxHalford avatar Nov 17 '24 22:11 MaxHalford