datatable Support dummy encoding similar to Pandas get

The current method for split_into_nhot only supports one column, and doesn't return the original name of the column, only the possible values. Users may want to do something similar to what get_dummies() does in Pandas.

I created this simple function that does the trick, it's not that efficient but could be an idea on how to implement the solution:

def ohe_columns(columns,df):
    df_work = df.copy()
    for column in columns:
        df_ohe = dt.str.split_into_nhot(df_work[column])
        df_ohe.names = [f'{column}_{col}' for col in df_ohe.names]
        df_work.cbind(df_ohe)
    return df_work

Example:

df = dt.Frame([["cat","dog","rat","cat","dog"],["brown","black","black","brown","black"]],names=["animal","color"])

ohe_columns(["animal","color"],df)

Result:

animal	color	animal_cat	animal_dog	animal_rat	color_brown	color_black
▪▪▪▪	▪▪▪▪	▪	▪	▪	▪	▪
cat	brown	1	0	0	1	0
dog	black	0	1	0	0	1
rat	black	0	0	1	0	1
cat	brown	1	0	0	1	0
dog	black	0	1	0	0	1

I can help in the Python development of this function but not in the C++ part.

Jul 28 '21 03:07 FavioVazquez

@FavioVazquez , you could also iterate through the dataframe, without having to call the parent datatable on each column

Jul 28 '21 14:07 samukweku

@FavioVazquez The python function that you created is perfect from the performance point of view. split_into_nhot() is fully parallel already and the rest of the code won't take significant amount of time at all.

Yes, split_into_nhot() could be improved to support multi-column frames, but that implementation won't be in any way faster than what you did. Because, in C++ it will do roughly the same loop over the columns and the same final cbind().

So basically we're talking about the convenience function here, not about the efficiency or performance.

Jul 29 '21 06:07 oleksiyskononenko

I could imagine function split_into_nhot() taking optional parameter prefix=, which could be either True (to prefix with the current column's name), or an explicit string. In that case splitting multi-column frame could be a simple one-liner like

dt.cbind([dt.str.split_into_nhot(col, prefix=True) for col in DT])

Jul 29 '21 19:07 st-pasha

Yes, keeping the old column name as a prefix sounds reasonable. As for the loop over the columns, I see no reason to move it to C++.

Jul 29 '21 20:07 oleksiyskononenko

Support dummy encoding similar to Pandas get_dummies()