SDV icon indicating copy to clipboard operation
SDV copied to clipboard

Automate the categorical encoder selection

Open pvk-developer opened this issue 4 years ago • 1 comments

Problem Description

Currently the Tabular Models use OneHotEncoder by default for the categorical values in a data set. This leads to the creation of n new columns which the model has to learn afterwards causing more fitting time and more memory usage.

Expected behavior

  • Select categorical if the number of unique values is too big, use one_hot_encoder otherwise.
  • Use one_hot_encoder for the most frequent categories and categorical for the rest.

pvk-developer avatar Mar 15 '21 11:03 pvk-developer

Since this feature request was filed, we have changed the default to categorical fuzzy whenever we can, which greatly improves performance. Users can override this (change back to one hot encoding) for particular columns.

We can keep this feature request open because we do not have a smart way to decide when to use which categorical encoder. We can possibly add such a logic to a future Preset model.

npatki avatar Jul 07 '22 20:07 npatki