[MODULE] - Sentence extraction
Please describe the module you would like to add to the content library I have one large paragraph which contains multiple sentences, which I want to detect
Do you already have an implementation? -
Additional context Use spaCy or something like detectormorse for this
Would be possible to use NTLK for this.
import nltk
text = "..."
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = ' '.join(sent_detector.tokenize(text.strip()))
I think spacy offers something similar. Since refinery uses spacy under the hood, I'd recommend building a spacy-based sentence tokenizer first :)
there is also a Bert Sentence Detector if I remember correctly https://huggingface.co/sentence-transformers
In refinery 2.0/cognition, it will be really interesting to detect sentences even if they are rather complex, since this allows us to create better chunks for RAG (embedding lists)