Idea: add a `UnstableLabelEncoder`
Related to RareLabelEncoder, I wrote an UnstableLabelEncoder that groups categories that are unstable over time.
You define n_time_buckets (for example 5) and a time_variable. Then I cut the time_variable into n_time_buckets and then per variable, per category, I look at the spread (range between min and max) of the normalized value_counts. If it's above tolerence tol, you can consider it stable. Probably clearer in code:
X["tmp_time_bucket_id"] = pd.cut(X[self.time_variable], self.n_time_buckets, labels=False)
for var in self.variables_:
if len(X[var].unique()) > self.n_categories:
# per time bucket, the % observations per label
t = X.groupby(["tmp_time_bucket_id"])[var].value_counts(normalize=True)
# per label, find the spread (max - min) across the time buckets
t = t.groupby("cat_veh_body_accssry_txt").agg(np.ptp)
# stable labels:
freq_idx = t[t <= self.tol].index
So if any category in a variable varies more than 5% (when tol is 0.05) across the time buckets, you can consider it unstable and replace it with replace_with (defaults to Unstable).
In a machine learning project you would probably set tol to be quite high, like 0.50. This way, if one of a variable's categories starts appearing somewhere in time (or stops appearing), you can avoid using it by throwing it into a generic 'Other' or 'Missing' category (depending how replace_with parameter). This avoids overfitting to a specific time period leading to an overconfident model performance estimate.
The method is related to DropHighPSIFeatures: that one removes the entire feature if it's unstable over time, while UnstableLabelEncoder would remove a category in a feature if it's unstable over time.
I've already written the class, so happy to open a PR if you're interested in including it in feature-engine.
I'm still playing with the metric to use for tol. Probably better to use the max absolute percentage difference from the mean. Easier to interpret: a feature category's proportion should not fluctuate more than xx% over time.
Hi @timvink
Thanks for the suggestion!
Did you create this method? or was it described somewhere else? if yes, would you be able to add some links for more information?
@gverbock what do you think about this suggestion?
Did you create this method?
I did.
Would you be able to add some links for more information?
I don't have them. I would need to spend more time looking for papers, running benchmarks and writing about experimental results.
Does it makes sense to close this issue until I have more information?
I would leave it open. And whenever you have the time to gather the information, just pin it here :)
What I find interesting in the discussion is how to deal with unstable categories in a feature. The DropHighPSI approoach is designed to work with numeric variable and the topic of categorical variable is not really addressed. In the current set-up it should be applied after OneHotEncoding and remove the unstable encoded features. (@timvink long time no see).
Good suggestion Gilles, I will experiment with OHE + DropHighPSI also. (Indeed long time, nice to run into you here!)