topic_modeling_tutorial icon indicating copy to clipboard operation
topic_modeling_tutorial copied to clipboard

Tutorial fixes

Open jonasrla opened this issue 9 years ago • 0 comments

On line input 13 of Notebook 1 there's this function

def split_words(self, text, stopwords=STOPWORDS):
    """
    Break text into a list of single words. Ignore any token that falls into
    the `stopwords` set.

    """
    return [word
            for word in gensim.utils.tokenize(text, lower=True)
            if word not in STOPWORDS and len(word) > 3]

And looking closely, stopwords is not used at all, instead it uses STOPWORDS.

Also, the corpora of the Notebook 2 that is, this one simplewiki-20140623-pages-articles.xml.bz2, is no longer available, because it is too outdated That can be fixed simply by referencing the lastest, simplewiki-latest-pages-articles.xml.bz2, that can be found here https://dumps.wikimedia.org/simplewiki/latest/

jonasrla avatar Apr 28 '16 04:04 jonasrla