SDV
SDV copied to clipboard
Automate the categorical encoder selection
Problem Description
Currently the Tabular Models use OneHotEncoder by default for the categorical values in a data set. This leads to the creation of n new columns which the model has to learn afterwards causing more fitting time and more memory usage.
Expected behavior
- Select
categoricalif the number of unique values is too big, useone_hot_encoderotherwise. - Use
one_hot_encoderfor the most frequent categories andcategoricalfor the rest.
Since this feature request was filed, we have changed the default to categorical fuzzy whenever we can, which greatly improves performance. Users can override this (change back to one hot encoding) for particular columns.
We can keep this feature request open because we do not have a smart way to decide when to use which categorical encoder. We can possibly add such a logic to a future Preset model.