edsnlp icon indicating copy to clipboard operation
edsnlp copied to clipboard

Problem detecting sentences

Open aricohen93 opened this issue 1 year ago • 1 comments

Description

The sentences pipeline has different and unexpected behaviour. Depending on the structure of the date, it will make one or two sentences.

Example:

text1 = "10.10.2010 : RCP" ## >> 2 sentences: [10.10.2010 :, RCP] text2 = "10/10/2010 : RCP" ## >> 1 sentences

How to reproduce the bug

import edsnlp.pipes as eds
import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.sentences())
nlp.add_pipe(eds.normalizer())
nlp.add_pipe(eds.dates())


text1 = "10.10.2010 : RCP" ## >> 2 sentences: [10.10.2010 :, RCP]
text2 = "10/10/2010 : RCP" ## >> 1 sentences

doc1 = nlp(text1)
doc2 = nlp(text2)

aricohen93 avatar Apr 24 '25 11:04 aricohen93

@percevalw @svittoz

Another example that doesn't split into 2 sentences:

import edsnlp.pipes as eds
import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.normalizer())
nlp.add_pipe(eds.sentences(check_capitalized=False, min_newline_count=1))
nlp(
    "Chimiothérapie par XXX (débutée en DATE et dernière cure C6 le DATE )\n-Radiothérapie antalgique rachis T10 L1 L3 en DATE\n\n"
)
len(list(doc.sents))

aricohen93 avatar May 23 '25 14:05 aricohen93