stdlib
stdlib copied to clipboard
[BUG]: nlp-sentencize wrongly breaks sentences in quotation marks
Description
As the title says.
Here are some quick examples
console.log(sentencize('I said "Look out" right before he banged his head'));
> [ 'I said "Look out" right before he banged his head' ] // This is correct
console.log(sentencize('I said "Look out!" right before he banged his head'));
> ['I said "Look out!"', 'right before he banged his head'] // This should be one sentence
From looking at the code it seems to be doing exactly as it's told, but doesn't seem quite right.
If it's a suffix aka
" and previous token is a punctuation mark .!?, then split.
Related Issues
#3013
Questions
No.
Demo
No response
Reproduction
console.log(sentencize('I said "Look out!" right before he banged his head'));
> ['I said "Look out!"', 'right before he banged his head']
Expected Results
['I said "Look out!" right before he banged his head']
Actual Results
['I said "Look out!"', 'right before he banged his head']
Version
0.2.2
Environments
Node.js
Browser Version
No response
Node.js / npm Version
v22.9.0
Platform
Windows 11
Checklist
- [x] Read and understood the Code of Conduct.
- [x] Searched for existing issues and pull requests.
The Tool is likely splitting based on punctuation marks, it seems to be applying the case where the sentence ends with one of those punctuation marks, which in such cases isn't true.
The logic could be updated to check if the punctuation mark (!, ., ?) is within a quotation.