Use irlba's truncated SVD to speed up step_pca
step_pca is very useful, but is slow and memory-intensive when run on more than a few hundred features, even if num_comp is much smaller than p. (In my experience this makes it especially time-intensive to tune the num_comp training parameter, which requires running the SVD preparation step many times).
As a solution, this step could use the irlba package for truncated SVD, which is much faster and more memory efficient when the number of components is small compared to p.
I could imagine the step either automatically using irlba when num_comp is far smaller than p, or doing so only when the user requests something like truncated = TRUE, but in any case it would be very helpful!
Reproducible example, if we were trying to build a model to identify which Jane Austen book a line of text came from:
library(janeaustenr)
library(recipes)
library(textrecipes)
# Train a model to match a single line to one of Jane Austen's books
books <- austen_books() %>%
filter(text != "")
rec <- recipe(book ~ text, books) %>%
step_tokenize(text) %>%
step_tokenfilter(text, max_tokens = 300) %>%
step_tfidf(text)
# This is slow (~40s for me), and uses so much memory that it's hard to terminate
rec %>%
step_pca(starts_with("tfidf"), num_comp = 5) %>%
prep() %>%
juice()
# But this is fast (~3.5s)
rec %>%
prep() %>%
juice() %>%
select(-book) %>%
as.matrix() %>%
irlba(nv = 5)
Related to #73
We are definitely interested in functionality like this! This is mostly implemented already so we'll get a draft PR ready and would love some feedback on it and/or more contributions. We are fairly sure we want to include this in embed, along with a Bayesian implementation of sparse PCA.
In general I'm a firm believer that PCA should default to a truncated SVD implementation (either irlba or RSpectra) and only switch to a full SVD when the user requests something like num_comp > p / 4 or something like that. It would also be nice to have a randomized SVD implementation (perhaps the rsvd package) for larger datasets, perhaps as step_pca_approximate().
Also cc @topepo https://github.com/DataSlingers/MoMA is a high quality sparse PCA implementation by Michael Weylandt (of the high quality glmnet replacement implementation)
This did issue get resolved in https://github.com/tidymodels/embed/pull/83 or should it be kept open for more step variants?
This did issue get resolved in https://github.com/tidymodels/embed/pull/83 or should it be kept open for more step variants?
I don't think this is resolved, since step_pca still uses full PCA by default, and the above reprex (getting 5 principal components from a dataset with 62k observations) is still slow. I agree with Alex that it can be made much faster in the common use case by making it the default:
In general I'm a firm believer that PCA should default to a truncated SVD implementation (either irlba or RSpectra) and only switch to a full SVD when the user requests something like num_comp > p / 4 or something like that
But maybe this issue belongs in the recipes package, since that's where step_pca lives?
I'd add an alternate PCA step here. Those package dependencies are a pita and I'd keep them here.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.