bricks
bricks copied to clipboard
[MODULE] - Smalltalk Truncation
Description This module removes all the irrelevant text from a passage or chat, and return the possible relevant information.
Implementation
import re
from nltk.corpus import stopwords
sw = stopwords.words('english')
text = '''"Hello, Diana".
"Hello, Phill. how are you doing?"
"I am doing fine, and you?".
"I am doing good as well!".
"Listen, I wanted to talk to you about the divorce papers our lawyer set up. Basically, he needs our signatures on the documents set up.".
"The document says that you accept to pay me a monthly alimony of 2000 dollars and that I will have the custody of our child.".
"Have a nice day."'''
regex = re.compile(r"\".*?\"")
for message in regex.findall(text):
chat = message.replace('"','')
chat = chat.split()
new_text = []
stop_words = []
for token in chat:
if token not in sw:
new_text.append(token)
else:
stop_words.append(token)
if (len(new_text) > 0.5*len(chat) or len(stop_words) > 8) and not len(chat) < 3:
print(" ".join(chat))
Additional context Add any other context or screenshots about the feature request here.
The module uses simple regex to extract sentences. It would be better to use NLTK for this.
For example like this:
import nltk
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
sent_detector.tokenize(text.strip())
Otherwise, sentences like "This is Dr. Smith and he lives in London." would get chopped up.
I found a small issue here: →refinery: change ATTRIBUTE = “text” →localhost: Error: Unprocessable Entity