[BUG]: nlp-sentencize wrongly splits sentences with multiple punctuation marks
Description
Hello! Not sure if this is the right place, but can't post in the other repo.
Using @stdlib/[email protected] with phrases like 'HAPPY BIRTHDAY!!!' will incorrectly return a sentence for every punctuation mark:
console.log(sentencize('HAPPY BIRTHDAY!!!'));
> ['HAPPY BIRTHDAY!', '!', '!']
console.log(sentencize('what??'));
> ['what?', '?']
console.log(sentencize('HOW DARE YOU?!?!'));
> ['HOW DARE YOU?', '!', '?', '!']
The above examples should be considered one sentence each
Weirdly enough it works well with ellipsis and phrases ending in !!!1!!11!!! and stuff like that. Such as:
console.log(sentencize('Yeah, about that...'));
> ['Yeah, about that...']
console.log(sentencize('OH EM GEE!!!1!!11!one!!1'));
> ['OH EM GEE!!!1!!11!one!!1']
This one is fine.
Cheers!
Related Issues
No response
Questions
No response
Demo
No response
Reproduction
const sentencize = require('@stdlib/nlp-sentencize'); console.log(sentencize('SURPRISE!!!'));
Expected Results
['SURPRISE!!!']
Actual Results
['SURPRISE!', '!', '!']
Version
0.2.2
Environments
Node.js
Browser Version
No response
Node.js / npm Version
v22.9.0
Platform
Windows 11
Checklist
- [x] Read and understood the Code of Conduct.
- [x] Searched for existing issues and pull requests.
:wave: Hi there! :wave:
And thank you for opening your first issue! We will get back to you shortly. :runner: :dash:
Punctuation is broken with prefixes/suffixes as well. I can make a new issue if need be.
console.log(sentencize('I said "Look out" right before he banged his head'));
> [ 'I said "Look out" right before he banged his head' ] // This is correct
console.log(sentencize('I said "Look out!" right before he banged his head'));
> ['I said "Look out!"', 'right before he banged his head'] // This should be one sentence
@Pupix Thanks for flagging these issues! A separate issue would be a good idea for that. I will be looking into these shortly.
hello I started working on the bug, I've made some progress but I am still testing out some edge cases
Hey @Planeshifter , I have created a PR for this issue as this nlp-sentencize has its own repo not accepting PR over there , Could you please help me understand where I should make this PR?
I also want to submit a PR for this issue, please tell how to submit my PR
For anyone hoping to work on this issue, note that, IMO, the only appropriate way to provide a robust implementation is by implementing the Unicode sentence segmentation algorithm.