Should the stopword list be updated?
I wrote a Python script to compare the list of stopwords currently used by pattern's vector module against other popular stopword lists to check whether an update is required.
A total of 11 sources of stopwords were used to compare with (listed below). For each source, the set of words present in the source but not in pattern's list of stopwords and vice versa were reported in the corresponding file. Comparison output was stored in the directory Results by my code here. File x.txt contains the comparison result by comparing the stopword list x against the stopword list used by pattern.
My code is available here: https://github.com/ni9elf/PatternClipsExperiments The comparison of stopword lists is performed by the find_important and compare function of the StemmerChecker class.
Lists of stopwords compared against:
-
unine.txt (formation retrieval multilingual resources from Universite de Neuchatel, Switzerland) link
-
princeton.txt (Algorithms book by Robert Sedgewick and Kevin Wayne, Princeton) link
-
nltk.txt (Natural Language Toolkit 3.2.5)
-
yoast.txt (YoastSEO is a text analysis and assessment library in JavaScript for SEO feedback) link
-
mysql.txt (MySQL Stopword list) link
-
ranksnl_short.txt (Default stopword list used by ranks.nl) link
-
ranksnl_long.txt (Longer version of stopword list used by ranks.nl) link
-
corenlp.txt (Stanford CoreNLP - natural language software) link
-
mallet.txt (MALLET (MAchine Learning for LanguagE Toolkit) from UMass Amherst) link
-
glasgow.txt (Information retrieval resources from University of Glasgow) link
-
onix.txt (Onix Text Retrieval Toolkit) link