Support dummy encoding similar to Pandas get_dummies()
The current method for split_into_nhot only supports one column, and doesn't return the original name of the column, only the possible values. Users may want to do something similar to what get_dummies() does in Pandas.
I created this simple function that does the trick, it's not that efficient but could be an idea on how to implement the solution:
def ohe_columns(columns,df):
df_work = df.copy()
for column in columns:
df_ohe = dt.str.split_into_nhot(df_work[column])
df_ohe.names = [f'{column}_{col}' for col in df_ohe.names]
df_work.cbind(df_ohe)
return df_work
Example:
df = dt.Frame([["cat","dog","rat","cat","dog"],["brown","black","black","brown","black"]],names=["animal","color"])
ohe_columns(["animal","color"],df)
Result:
| animal | color | animal_cat | animal_dog | animal_rat | color_brown | color_black |
|---|---|---|---|---|---|---|
| ▪▪▪▪ | ▪▪▪▪ | ▪ | ▪ | ▪ | ▪ | ▪ |
| cat | brown | 1 | 0 | 0 | 1 | 0 |
| dog | black | 0 | 1 | 0 | 0 | 1 |
| rat | black | 0 | 0 | 1 | 0 | 1 |
| cat | brown | 1 | 0 | 0 | 1 | 0 |
| dog | black | 0 | 1 | 0 | 0 | 1 |
I can help in the Python development of this function but not in the C++ part.
@FavioVazquez , you could also iterate through the dataframe, without having to call the parent datatable on each column
@FavioVazquez The python function that you created is perfect from the performance point of view. split_into_nhot() is fully parallel already and the rest of the code won't take significant amount of time at all.
Yes, split_into_nhot() could be improved to support multi-column frames, but that implementation won't be in any way faster than what you did. Because, in C++ it will do roughly the same loop over the columns and the same final cbind().
So basically we're talking about the convenience function here, not about the efficiency or performance.
I could imagine function split_into_nhot() taking optional parameter prefix=, which could be either True (to prefix with the current column's name), or an explicit string.
In that case splitting multi-column frame could be a simple one-liner like
dt.cbind([dt.str.split_into_nhot(col, prefix=True) for col in DT])
Yes, keeping the old column name as a prefix sounds reasonable. As for the loop over the columns, I see no reason to move it to C++.