bricks icon indicating copy to clipboard operation
bricks copied to clipboard

[MODULE] - Smalltalk Truncation

Open divyanshukatiyar opened this issue 3 years ago • 2 comments

Description This module removes all the irrelevant text from a passage or chat, and return the possible relevant information.

Implementation

import re
from nltk.corpus import stopwords

sw = stopwords.words('english')
text = '''"Hello, Diana".
"Hello, Phill. how are you doing?"
"I am doing fine, and you?". 
"I am doing good as well!". 
"Listen, I wanted to talk to you about the divorce papers our lawyer set up. Basically, he needs our signatures on the documents set up.".
"The document says that you accept to pay me a monthly alimony of 2000 dollars and that I will have the custody of our child.".
"Have a nice day."'''
regex = re.compile(r"\".*?\"")

for message in regex.findall(text):
    chat = message.replace('"','')
    chat = chat.split()
    new_text = []
    stop_words = []
    for token in chat:
        if token not in sw:
            new_text.append(token)
        else:
            stop_words.append(token)
    if (len(new_text) > 0.5*len(chat) or len(stop_words) > 8) and not len(chat) < 3: 
        print(" ".join(chat))

Additional context Add any other context or screenshots about the feature request here.

divyanshukatiyar avatar Nov 21 '22 22:11 divyanshukatiyar

The module uses simple regex to extract sentences. It would be better to use NLTK for this.

For example like this:

import nltk

sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
sent_detector.tokenize(text.strip())

Otherwise, sentences like "This is Dr. Smith and he lives in London." would get chopped up.

LeonardPuettmann avatar Nov 28 '22 10:11 LeonardPuettmann

I found a small issue here: →refinery: change ATTRIBUTE = “text” →localhost: Error: Unprocessable Entity

SvenjaKern avatar Jul 21 '23 16:07 SvenjaKern