edsnlp
edsnlp copied to clipboard
Problem detecting sentences
Description
The sentences pipeline has different and unexpected behaviour. Depending on the structure of the date, it will make one or two sentences.
Example:
text1 = "10.10.2010 : RCP" ## >> 2 sentences: [10.10.2010 :, RCP] text2 = "10/10/2010 : RCP" ## >> 1 sentences
How to reproduce the bug
import edsnlp.pipes as eds
import edsnlp
nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.sentences())
nlp.add_pipe(eds.normalizer())
nlp.add_pipe(eds.dates())
text1 = "10.10.2010 : RCP" ## >> 2 sentences: [10.10.2010 :, RCP]
text2 = "10/10/2010 : RCP" ## >> 1 sentences
doc1 = nlp(text1)
doc2 = nlp(text2)
@percevalw @svittoz
Another example that doesn't split into 2 sentences:
import edsnlp.pipes as eds
import edsnlp
nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.normalizer())
nlp.add_pipe(eds.sentences(check_capitalized=False, min_newline_count=1))
nlp(
"Chimiothérapie par XXX (débutée en DATE et dernière cure C6 le DATE )\n-Radiothérapie antalgique rachis T10 L1 L3 en DATE\n\n"
)
len(list(doc.sents))