bricks icon indicating copy to clipboard operation
bricks copied to clipboard

[MODULE] -Spelling Check

Open divyanshukatiyar opened this issue 3 years ago • 12 comments

Description This module checks for any misspelled words in a given string. It uses the nltk suite which consists of libraries and programs that can be used for statistical NLP. From nltk, we import the words corpus which consists of almost half a million words, and brown corpus. We create a union of the two corpora and match all the words of the text to the ones in this union. The unmatched words are assumed to be incorrect and this module tells the number of incorrect words from the text. Implementation

import nltk 
nltk.download('words', 'brown')
from nltk.corpus import words, brown
    
words_cor = words.words()
brown_cor = brown.words()
correct_words = words_cor + brown_cor
text = "About today, somthing was not rigt."
text_list = text.replace(',', '').replace('.','').lower().split()

misspelled = []
for word in text_list:
    if word not in correct_words:
        misspelled.append(word)

print({
    "spellingErrors": len(misspelled)
})

Input "About today, somthing was not rigt." Output {"spellingError": 2}

Additional information The module requires a corpus to work on. On local machine, it might not be pre-installed but can be installed by running nltk.download('words') before importing.

divyanshukatiyar avatar Nov 09 '22 15:11 divyanshukatiyar

This uses a server under the hood, right? Might be difficult for setup in refinery, but let's give it a try :)

jhoetter avatar Nov 09 '22 15:11 jhoetter

This uses a server under the hood, right? Might be difficult for setup in refinery, but let's give it a try :)

Yes let's try it out. If it doesn't work, we can look for alternatives.

divyanshukatiyar avatar Nov 09 '22 16:11 divyanshukatiyar

Since the language-tool based approach caused some issues, how about using https://norvig.com/spell-correct.html?

jhoetter avatar Nov 10 '22 14:11 jhoetter

@jhoetter THAT IS SO COOL, how did you find this?

LeonardPuettmann avatar Nov 10 '22 14:11 LeonardPuettmann

Was part of some AI course I did a few years ago :)

jhoetter avatar Nov 14 '22 10:11 jhoetter

I think @divyanshukatiyar is making some changes to that module, so I'll reopen it.

jhoetter avatar Nov 14 '22 11:11 jhoetter

Alright so I made some changes to the module. To keep it in the classifier this module just tells you how many spelling errors are there in the text string.

divyanshukatiyar avatar Nov 15 '22 12:11 divyanshukatiyar

On the live version, this module endpoint is not working. To be usable in a production environment we also need to think about ways to make this module a bit quicker. Putting this to draft, let's have a look at this together next week!

LeonardPuettmann avatar Jan 06 '23 09:01 LeonardPuettmann

I have rewritten the module to use the TextBlob library, which implements the proposed method by Peter Norvig. The approach is not perfect, but I think it's alright for fast and free service. See PR #247

LeonardPuettmann avatar Feb 20 '23 09:02 LeonardPuettmann

After the corpus was loaded into a set, the module is much faster. Closing now.

LeonardPuettmannKern avatar Apr 25 '23 08:04 LeonardPuettmannKern

I found a small issue: In the Fastapi it is producing a Internal Server Error. Also in common following lines are missing: import nltk nltk.download('words', 'brown')

refinery works fine

SvenjaKern avatar Jun 23 '23 11:06 SvenjaKern

Yes that's right. The code "assumes" that both corpora are already downloaded. We should add a catch that checks if that's the case and downloads them if not. Good catch!

LeonardPuettmannKern avatar Jun 23 '23 11:06 LeonardPuettmannKern