embed Use irlba's truncated SVD to speed up step

step_pca is very useful, but is slow and memory-intensive when run on more than a few hundred features, even if num_comp is much smaller than p. (In my experience this makes it especially time-intensive to tune the num_comp training parameter, which requires running the SVD preparation step many times).

As a solution, this step could use the irlba package for truncated SVD, which is much faster and more memory efficient when the number of components is small compared to p.

I could imagine the step either automatically using irlba when num_comp is far smaller than p, or doing so only when the user requests something like truncated = TRUE, but in any case it would be very helpful!

Reproducible example, if we were trying to build a model to identify which Jane Austen book a line of text came from:

library(janeaustenr)
library(recipes)
library(textrecipes)

# Train a model to match a single line to one of Jane Austen's books 
books <- austen_books() %>%
  filter(text != "")

rec <- recipe(book ~ text, books) %>%
  step_tokenize(text) %>%
  step_tokenfilter(text, max_tokens = 300) %>%
  step_tfidf(text)

# This is slow (~40s for me), and uses so much memory that it's hard to terminate
rec %>%
  step_pca(starts_with("tfidf"), num_comp = 5) %>%
  prep() %>%
  juice()

# But this is fast (~3.5s)
rec %>%
  prep() %>%
  juice() %>%
  select(-book) %>%
  as.matrix() %>%
  irlba(nv = 5)

Jun 04 '21 17:06 dgrtwo

Related to #73

We are definitely interested in functionality like this! This is mostly implemented already so we'll get a draft PR ready and would love some feedback on it and/or more contributions. We are fairly sure we want to include this in embed, along with a Bayesian implementation of sparse PCA.

Jun 07 '21 21:06 juliasilge

In general I'm a firm believer that PCA should default to a truncated SVD implementation (either irlba or RSpectra) and only switch to a full SVD when the user requests something like num_comp > p / 4 or something like that. It would also be nice to have a randomized SVD implementation (perhaps the rsvd package) for larger datasets, perhaps as step_pca_approximate().

Jun 07 '21 21:06 alexpghayes

Also cc @topepo https://github.com/DataSlingers/MoMA is a high quality sparse PCA implementation by Michael Weylandt (of the high quality glmnet replacement implementation)

Jun 07 '21 21:06 alexpghayes

This did issue get resolved in https://github.com/tidymodels/embed/pull/83 or should it be kept open for more step variants?

Mar 30 '22 03:03 EmilHvitfeldt

This did issue get resolved in https://github.com/tidymodels/embed/pull/83 or should it be kept open for more step variants?

I don't think this is resolved, since step_pca still uses full PCA by default, and the above reprex (getting 5 principal components from a dataset with 62k observations) is still slow. I agree with Alex that it can be made much faster in the common use case by making it the default:

In general I'm a firm believer that PCA should default to a truncated SVD implementation (either irlba or RSpectra) and only switch to a full SVD when the user requests something like num_comp > p / 4 or something like that

But maybe this issue belongs in the recipes package, since that's where step_pca lives?

Mar 30 '22 13:03 dgrtwo

I'd add an alternate PCA step here. Those package dependencies are a pita and I'd keep them here.

Apr 19 '22 21:04 topepo

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

Apr 12 '23 01:04 github-actions[bot]

Use irlba's truncated SVD to speed up step_pca