[MODULE] -Spelling Check
Description
This module checks for any misspelled words in a given string. It uses the nltk suite which consists of libraries and programs that can be used for statistical NLP. From nltk, we import the words corpus which consists of almost half a million words, and brown corpus. We create a union of the two corpora and match all the words of the text to the ones in this union. The unmatched words are assumed to be incorrect and this module tells the number of incorrect words from the text.
Implementation
import nltk
nltk.download('words', 'brown')
from nltk.corpus import words, brown
words_cor = words.words()
brown_cor = brown.words()
correct_words = words_cor + brown_cor
text = "About today, somthing was not rigt."
text_list = text.replace(',', '').replace('.','').lower().split()
misspelled = []
for word in text_list:
if word not in correct_words:
misspelled.append(word)
print({
"spellingErrors": len(misspelled)
})
Input
"About today, somthing was not rigt."
Output
{"spellingError": 2}
Additional information
The module requires a corpus to work on. On local machine, it might not be pre-installed but can be installed by running
nltk.download('words') before importing.
This uses a server under the hood, right? Might be difficult for setup in refinery, but let's give it a try :)
This uses a server under the hood, right? Might be difficult for setup in refinery, but let's give it a try :)
Yes let's try it out. If it doesn't work, we can look for alternatives.
Since the language-tool based approach caused some issues, how about using https://norvig.com/spell-correct.html?
@jhoetter THAT IS SO COOL, how did you find this?
Was part of some AI course I did a few years ago :)
I think @divyanshukatiyar is making some changes to that module, so I'll reopen it.
Alright so I made some changes to the module. To keep it in the classifier this module just tells you how many spelling errors are there in the text string.
On the live version, this module endpoint is not working. To be usable in a production environment we also need to think about ways to make this module a bit quicker. Putting this to draft, let's have a look at this together next week!
I have rewritten the module to use the TextBlob library, which implements the proposed method by Peter Norvig. The approach is not perfect, but I think it's alright for fast and free service. See PR #247
After the corpus was loaded into a set, the module is much faster. Closing now.
I found a small issue: In the Fastapi it is producing a Internal Server Error. Also in common following lines are missing: import nltk nltk.download('words', 'brown')
refinery works fine
Yes that's right. The code "assumes" that both corpora are already downloaded. We should add a catch that checks if that's the case and downloads them if not. Good catch!