datatable icon indicating copy to clipboard operation
datatable copied to clipboard

Support dummy encoding similar to Pandas get_dummies()

Open FavioVazquez opened this issue 4 years ago • 4 comments

The current method for split_into_nhot only supports one column, and doesn't return the original name of the column, only the possible values. Users may want to do something similar to what get_dummies() does in Pandas.

I created this simple function that does the trick, it's not that efficient but could be an idea on how to implement the solution:

def ohe_columns(columns,df):
    df_work = df.copy()
    for column in columns:
        df_ohe = dt.str.split_into_nhot(df_work[column])
        df_ohe.names = [f'{column}_{col}' for col in df_ohe.names]
        df_work.cbind(df_ohe)
    return df_work

Example:

df = dt.Frame([["cat","dog","rat","cat","dog"],["brown","black","black","brown","black"]],names=["animal","color"])

ohe_columns(["animal","color"],df)

Result:

animal color animal_cat animal_dog animal_rat color_brown color_black
▪▪▪▪ ▪▪▪▪
cat brown 1 0 0 1 0
dog black 0 1 0 0 1
rat black 0 0 1 0 1
cat brown 1 0 0 1 0
dog black 0 1 0 0 1

I can help in the Python development of this function but not in the C++ part.

FavioVazquez avatar Jul 28 '21 03:07 FavioVazquez

@FavioVazquez , you could also iterate through the dataframe, without having to call the parent datatable on each column

samukweku avatar Jul 28 '21 14:07 samukweku

@FavioVazquez The python function that you created is perfect from the performance point of view. split_into_nhot() is fully parallel already and the rest of the code won't take significant amount of time at all.

Yes, split_into_nhot() could be improved to support multi-column frames, but that implementation won't be in any way faster than what you did. Because, in C++ it will do roughly the same loop over the columns and the same final cbind().

So basically we're talking about the convenience function here, not about the efficiency or performance.

oleksiyskononenko avatar Jul 29 '21 06:07 oleksiyskononenko

I could imagine function split_into_nhot() taking optional parameter prefix=, which could be either True (to prefix with the current column's name), or an explicit string. In that case splitting multi-column frame could be a simple one-liner like

dt.cbind([dt.str.split_into_nhot(col, prefix=True) for col in DT])

st-pasha avatar Jul 29 '21 19:07 st-pasha

Yes, keeping the old column name as a prefix sounds reasonable. As for the loop over the columns, I see no reason to move it to C++.

oleksiyskononenko avatar Jul 29 '21 20:07 oleksiyskononenko