Restricting dtm/tf_idf creation to only the top N features from the lexicon

Open pazzo83 opened this issue 7 years ago • 1 comments

I am working with a corpus of 100k+ documents, so the number of features in the lexicon is extremely high. Thus I'm running into memory issues and the like. I know in scipy's TfidfVectorizer and similar approaches, you can limit the number of features such that you only are dealing with the top N features when working with the dtm and tf_idf matrices. Is there some way to do that with this package, or are there plans to add such a feature?

Thanks!

Feb 17 '18 17:02 pazzo83

Not yet, but might be worth adding.

Feb 18 '18 20:02 aviks