pattern icon indicating copy to clipboard operation
pattern copied to clipboard

Is Porter stemmer working correctly?

Open ni9elf opened this issue 7 years ago • 0 comments

I wrote a Python script to check the output of pattern's implementation of the Porter2 stemmer (in the vector module) against the output of the original implementation by Martin Porter.

Martin Porter provides a test input vocabulary of 29417 words and corresponding stemmed outputs of these words obtained from his implementation of the stemmer. My script compares the output of pattern's own Porter stemmer implementation with the output of the original implementation. A total of 215 errors were found. These errors are stored in the file errors.txt by my script available here. Sample preview:

word_input original_output pattern_output
aimlessly aimless aimlessli
gazelle gazell gazel
narratives narrat narr

Pattern implements the Porter stemmer in the vector module which can be used by first importing, from pattern.vector import stem, PORTER, and then running stem(input, stemmer=PORTER). My code is available here: https://github.com/ni9elf/PatternClipsExperiments

ni9elf avatar Apr 11 '18 16:04 ni9elf