TextAnalysis.jl
TextAnalysis.jl copied to clipboard
Restricting dtm/tf_idf creation to only the top N features from the lexicon
I am working with a corpus of 100k+ documents, so the number of features in the lexicon is extremely high. Thus I'm running into memory issues and the like. I know in scipy's TfidfVectorizer and similar approaches, you can limit the number of features such that you only are dealing with the top N features when working with the dtm and tf_idf matrices. Is there some way to do that with this package, or are there plans to add such a feature?
Thanks!
Not yet, but might be worth adding.