Scattertext-PyData icon indicating copy to clipboard operation
Scattertext-PyData copied to clipboard

Dimensionality reduction

Open ebaggott opened this issue 7 years ago • 1 comments

This is great! How can one incorporate dimensionality reduction into the pipeline? For substantive and speed reasons, I'd like to exclude the most and least common words:

corpus = st.CorpusFromPandas(df, category_col='country', text_col='text', nlp=nlp, # can we discard 1st and 99th percentile of words here? ).build()

ebaggott avatar Jun 01 '18 16:06 ebaggott

Thanks!

Right now there's no way to exclude terms during corpus construction. However, after the corpus is constructed, you can easily remove outlying terms. For example:

# Remove bigrams from corpus.
corpus = corpus.get_unigram_corpus() 

# Create a pandas Series indexed on words containing their frequencies
term_frequencies = corpus.get_term_freq_df().sum(axis=1)

# Get the terms in the 99th and 1st percentiles
terms_99th_pctl = term_frequencies[term_frequencies >= np.percentile(term_frequencies, 99)].index
terms_1st_pctl = term_frequencies[term_frequencies <= np.percentile(term_frequencies, 1)].index

# Remove them from the corpus
reduced_corpus = corpus.remove_terms(terms_99th_pctl | terms_1st_pctl)

Hope this helps!

JasonKessler avatar Jun 01 '18 17:06 JasonKessler