bricks icon indicating copy to clipboard operation
bricks copied to clipboard

[MODULE] - Sentence extraction

Open jhoetter opened this issue 3 years ago • 4 comments

Please describe the module you would like to add to the content library I have one large paragraph which contains multiple sentences, which I want to detect

Do you already have an implementation? -

Additional context Use spaCy or something like detectormorse for this

jhoetter avatar Nov 07 '22 22:11 jhoetter

Would be possible to use NTLK for this.

import nltk

text = "..."

sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = ' '.join(sent_detector.tokenize(text.strip()))

LeonardPuettmann avatar Nov 23 '22 09:11 LeonardPuettmann

I think spacy offers something similar. Since refinery uses spacy under the hood, I'd recommend building a spacy-based sentence tokenizer first :)

jhoetter avatar Nov 24 '22 12:11 jhoetter

there is also a Bert Sentence Detector if I remember correctly https://huggingface.co/sentence-transformers

SvenjaKernAi avatar May 12 '23 06:05 SvenjaKernAi

In refinery 2.0/cognition, it will be really interesting to detect sentences even if they are rather complex, since this allows us to create better chunks for RAG (embedding lists)

jhoetter avatar Sep 26 '23 15:09 jhoetter