topic_modeling_tutorial
topic_modeling_tutorial copied to clipboard
Tutorial fixes
On line input 13 of Notebook 1 there's this function
def split_words(self, text, stopwords=STOPWORDS):
"""
Break text into a list of single words. Ignore any token that falls into
the `stopwords` set.
"""
return [word
for word in gensim.utils.tokenize(text, lower=True)
if word not in STOPWORDS and len(word) > 3]
And looking closely, stopwords is not used at all, instead it uses STOPWORDS.
Also, the corpora of the Notebook 2 that is, this one simplewiki-20140623-pages-articles.xml.bz2, is no longer available, because it is too outdated That can be fixed simply by referencing the lastest, simplewiki-latest-pages-articles.xml.bz2, that can be found here https://dumps.wikimedia.org/simplewiki/latest/