Leveraging the "category" type from pandas for Categorical-Ordinal
Problem Description
Hello, if I understood the API correctly. Dimensions are either "real", "discrete" or "categorical" and this is automatically inferred based on the dtype of each column. For example if dtype == "O" this is inferred to be a "categorical". Therefore, all categorical are assumed to be non-ordinal (without ordering) and therefore one-hot-encoded for the models.
Expected behavior
I think without much effort it could also be possible to have ordinal categorical (encoded as integer labels for example). This could be inferred from the "category" type of Pandas and its ordered=True/False parameter.
If you think it could be useful as well, you can point me toward the code to edit and I can submit this contribution. Generally adding this simple "order" information can help models a lot.
Hi @Deathn0t, thanks for starting this discussion! I think there are 2 separate points you are bringing up:
- Right now, the SDV does not accept a dataframe if it includes any columns of dtype
"category"-- we only accept dtype"object". This should be fixed. - When a dtype is
"category", we can be smarter about capturing any ordering info attached to the dtype
The team is actively thinking about metadata specification and input handling. There are many moving pieces. Why don't I reach back out to you when we have more clarity? That way, you can make changes on a more stable feature set.
Quick update here: We are moving towards a better integration of ordinal categorical variables.
The newest version of RDT now includes an ordering parameter so that we can order the categories before transforming them. See CustomLabelEncoder.
This functionality has not yet been integrated with the SDV models. Once it is, we will be able to provide better support for categories.