Scattertext-PyData
Scattertext-PyData copied to clipboard
Dimensionality reduction
This is great! How can one incorporate dimensionality reduction into the pipeline? For substantive and speed reasons, I'd like to exclude the most and least common words:
corpus = st.CorpusFromPandas(df, category_col='country', text_col='text', nlp=nlp, # can we discard 1st and 99th percentile of words here? ).build()
Thanks!
Right now there's no way to exclude terms during corpus construction. However, after the corpus is constructed, you can easily remove outlying terms. For example:
# Remove bigrams from corpus.
corpus = corpus.get_unigram_corpus()
# Create a pandas Series indexed on words containing their frequencies
term_frequencies = corpus.get_term_freq_df().sum(axis=1)
# Get the terms in the 99th and 1st percentiles
terms_99th_pctl = term_frequencies[term_frequencies >= np.percentile(term_frequencies, 99)].index
terms_1st_pctl = term_frequencies[term_frequencies <= np.percentile(term_frequencies, 1)].index
# Remove them from the corpus
reduced_corpus = corpus.remove_terms(terms_99th_pctl | terms_1st_pctl)
Hope this helps!