Handling Categorical Features
Hello! Thanks for creating this great open source repo!
I wanted to discuss some basic questions. I'm not the most familiar with this repo or the example datasets so please feel free to provide additional context and help me better understand 🙂
-
For all notebooks that use the UCI Adult dataset, it seems that both the Education and EducationNum features are input to the model. However, these are the same feature, so I was confused why both were used. Also, EducationNum is being interpreted as a continuous variable. Although the categories in this feature seem ordinal, using it as continuous implies that each increment has equal contribution, and the distance between each category is identical. Should the notebooks only use Education since that is categorical, which seems to be the proper way to represent this feature?
-
As a related question, are the default feature types supposed to be interpreted from the input data? For example, if I change the type of EducationNum to be category, the EBM still interprets it as continuous. To me, this wasn't that expected behavior; LightGBMs preserve the datatype of category (even without explicitly specifying the categorical columns), but here it seems that only strings are interpreted as categorical? I attached a minimal example in this notebook. 20220125_EBM_categorical_example.ipynb.zip
Thanks for your help!
Hi @kspieks,
Good questions! Here are some answers:
-
Yes, in practice you should drop one version of this feature. We just leave it in as a demonstrable example of how the model handles different encodings as users explore the model for the first time, but it's certainly not the best data science practice to leave both versions in.
-
We try to do our best at infering data types -- under the hood, we use Pandas APIs like https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.infer_objects.html to auto-guess what type we think a column should be. However, to have full control over types, we recommend passing in your type definition to the
feature_typesproperty when initializing the ExplainableBoostingClassifier:
from interpret.glassbox import ExplainableBoostingClassifier
# If your input data has 4 features...
ebm = ExplainableBoostingClassifier(feature_types = ["continuous", "categorical", "categorical", "continuous"])
Thanks for pointing out the issue with the auto infer on the category dtype -- that looks like a bug!
-InterpretML Team
Thanks for your replies!
-
I see. Thanks for the clarification.
-
Yea
feature_typesseemed to be the only way to resolve it, which was unexpected. What's interesting is that using theinfer_objectsmethod works correctly on its own i.e. casting the dtype to category and then usingdf.infer_objects().dtypescorrectly shows that the dtype is category. Yet the EBM interprets it as continuous, so perhaps the issue is caused by something else in the code?
Hi @kspieks -- This preprocessing code has been completely replaced, and Pandas is no longer used to infer feature types, so it's very likely that older bugs in this area have been resolved.