SynapseML Support to resolve high cardinality categorical problem

When doing machine learning project, an important step is preprocessing data. There is a common problem, "high cardinality" when there are too many of unique values for categorial columns. One solution is lump categorical column to top k and assign other values to "other" category.

Describe the solution you'd like SynapseML has a transformer I can use to do the work easily, for example,

lumpModel =(LumpFeatureTransformer()
              .setLumpRules({"Color": 5, "Education": 3, "Region": 8})
            ).fit(data)
df = lumpModel.transform(data)

The returned data frame will be processed by following the rules,

"Color" column, keep top 5 and assign other color values as "other"
"Education" column, keep top 3 and assign other education values as "other"
"Region" column, keep top 8 and assign other region values as "other"

Additional context I didn't find any other library has this function, happy to close this feature request if it is available somewhere.

Mar 25 '23 00:03 dylanw-oss

Hey @dylanw-oss :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.

Mar 25 '23 00:03 github-actions[bot]

asking mhamilton

Mar 28 '23 17:03 svotaw

Our data scientists are asking this feature and it will be very helpful to them. Pls let me know if it makes sense, I'll be working on it.

Mar 28 '23 20:03 dylanw-oss

@dylanw-oss Not sure what the exact scenario is, but one way to deal with high-cardinality skewed categorical variable is to encode the feature as is, but use regularization when fitting the estimator.

Apr 06 '23 07:04 memoryz

I think regularization is used when doing model.fit, but if input dataset has a few high cardinality skewed categorical columns, will it affect encoding (very slow or not necessary to treat all values as meaningful categorical values?). Sarah already used this pattern in SRM notebook, basically, this is to extract it to a library.

@sarahshy to add input

Apr 26 '23 02:04 dylanw-oss

A good example would be zip code. It may be high cardinality but most observations are concentrated in a few zip codes. Binning/lumping is probably the most straightforward solution here that allows users to specify how many categories to keep.

Apr 26 '23 19:04 sarahshy