finnts icon indicating copy to clipboard operation
finnts copied to clipboard

add embeddings for all data models

Open mitokic opened this issue 4 years ago • 1 comments

capture relationships between categorical data like time series ID and other groupings. Helpful in deep learning models, not sure if helpful in standard multivariate ML models.

Create embeddings from deep learning models. Then use those values for categorical variables instead of using one-hot encoding/dummy variables for categorical data. Woohoo!

Could also create a separate recipe to do this or do some initial testing on the dataset and see if we should switch over to it as default. Maybe make a global option to either use dummy variables or embeddings for categorical data.

image

Excerpt from fast.ai deep learning for coders book.

Already looks like an easy integration into a recipe. https://embed.tidymodels.org/reference/step_embed.html

Lastly, we need to determine that size of our embedding. There is no steadfast rule on how to do this but a good heuristic given by Jermey Howard of Fast.Ai is to take half the number of unique values then add one.

mitokic avatar Jul 08 '21 21:07 mitokic

https://insidebigdata.com/2021/03/07/video-highlights-deep-learning-for-probabilistic-time-series-forecasting/

link to video.

Also made a good point around not using dummy variable for holiday, but instead a countdown of periods until holiday and periods after holiday. Could do the same for other categorical regressors.

mitokic avatar Jul 08 '21 21:07 mitokic