interpret icon indicating copy to clipboard operation
interpret copied to clipboard

Handling Categorical Features

Open kspieks opened this issue 4 years ago • 2 comments

Hello! Thanks for creating this great open source repo!

I wanted to discuss some basic questions. I'm not the most familiar with this repo or the example datasets so please feel free to provide additional context and help me better understand 🙂

  1. For all notebooks that use the UCI Adult dataset, it seems that both the Education and EducationNum features are input to the model. However, these are the same feature, so I was confused why both were used. Also, EducationNum is being interpreted as a continuous variable. Although the categories in this feature seem ordinal, using it as continuous implies that each increment has equal contribution, and the distance between each category is identical. Should the notebooks only use Education since that is categorical, which seems to be the proper way to represent this feature?

  2. As a related question, are the default feature types supposed to be interpreted from the input data? For example, if I change the type of EducationNum to be category, the EBM still interprets it as continuous. To me, this wasn't that expected behavior; LightGBMs preserve the datatype of category (even without explicitly specifying the categorical columns), but here it seems that only strings are interpreted as categorical? I attached a minimal example in this notebook. 20220125_EBM_categorical_example.ipynb.zip

Thanks for your help!

kspieks avatar Jan 25 '22 15:01 kspieks

Hi @kspieks,

Good questions! Here are some answers:

  1. Yes, in practice you should drop one version of this feature. We just leave it in as a demonstrable example of how the model handles different encodings as users explore the model for the first time, but it's certainly not the best data science practice to leave both versions in.

  2. We try to do our best at infering data types -- under the hood, we use Pandas APIs like https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.infer_objects.html to auto-guess what type we think a column should be. However, to have full control over types, we recommend passing in your type definition to the feature_types property when initializing the ExplainableBoostingClassifier:

from interpret.glassbox import ExplainableBoostingClassifier

# If your input data has 4 features...
ebm = ExplainableBoostingClassifier(feature_types = ["continuous", "categorical", "categorical", "continuous"])

Thanks for pointing out the issue with the auto infer on the category dtype -- that looks like a bug!

-InterpretML Team

interpret-ml avatar Jan 26 '22 06:01 interpret-ml

Thanks for your replies!

  1. I see. Thanks for the clarification.

  2. Yea feature_types seemed to be the only way to resolve it, which was unexpected. What's interesting is that using the infer_objects method works correctly on its own i.e. casting the dtype to category and then using df.infer_objects().dtypes correctly shows that the dtype is category. Yet the EBM interprets it as continuous, so perhaps the issue is caused by something else in the code?

kspieks avatar Jan 26 '22 14:01 kspieks

Hi @kspieks -- This preprocessing code has been completely replaced, and Pandas is no longer used to infer feature types, so it's very likely that older bugs in this area have been resolved.

paulbkoch avatar Feb 13 '23 03:02 paulbkoch