Aliro icon indicating copy to clipboard operation
Aliro copied to clipboard

metadata for `categorical_cols` should be determined by user specification

Open hjwilli opened this issue 5 years ago • 1 comments

The methods used to generate the metafeatures are currently calculating which columns are categorical by examining the data. For example, if file that contains some strings in a column, that field should show up in _categorical_cols. However, if a file is uploaded that has only integers in a column it would not show up in _categorical_cols, even if it were specified to be a categorical by the user.

Update metadata that so that _categorical_cols uses the uses the columns specified by the user instead of the calculation.

Note that this only effects how data is encoded when calculating metafeatures. When running experiments, user preferences are honored.

hjwilli avatar Jul 09 '20 20:07 hjwilli

To discuss -

  • Differing nomenclature - In the metafeature class (dataset_describe.py) columns can be categorical, and categorical columns can be encoded as nominal or ordinal. This differs slightly from the way these terms are used in other parts of the application, where columns refered to as "categorial" are equivalent to what dataset_describe refers to as "nominal".

  • When calculating metafeatures, dataset_describe.py currently only allows for one encoder that applies to all categorical columns in a dataset, it does not currently allow both nominal and ordinal data in the same dataset.

hjwilli avatar Jul 09 '20 21:07 hjwilli