stdlib [BUG]: nlp-sentencize wrongly splits sentences with multiple punctuation marks

Description

Hello! Not sure if this is the right place, but can't post in the other repo.

Using @stdlib/[email protected] with phrases like 'HAPPY BIRTHDAY!!!' will incorrectly return a sentence for every punctuation mark:

console.log(sentencize('HAPPY BIRTHDAY!!!'));
> ['HAPPY BIRTHDAY!', '!', '!']

console.log(sentencize('what??'));
>  ['what?', '?']

console.log(sentencize('HOW DARE YOU?!?!'));
> ['HOW DARE YOU?', '!', '?', '!']

The above examples should be considered one sentence each

Weirdly enough it works well with ellipsis and phrases ending in !!!1!!11!!! and stuff like that. Such as:


console.log(sentencize('Yeah, about that...'));
> ['Yeah, about that...']

console.log(sentencize('OH EM GEE!!!1!!11!one!!1'));
> ['OH EM GEE!!!1!!11!one!!1']

This one is fine.

Cheers!

Related Issues

No response

Questions

No response

Demo

No response

Reproduction

const sentencize = require('@stdlib/nlp-sentencize'); console.log(sentencize('SURPRISE!!!'));

Expected Results

['SURPRISE!!!']

Actual Results

['SURPRISE!', '!', '!']

Version

0.2.2

Environments

Node.js

Browser Version

No response

Node.js / npm Version

v22.9.0

Platform

Windows 11

Checklist

[x] Read and understood the Code of Conduct.
[x] Searched for existing issues and pull requests.

Oct 16 '24 21:10 Pupix

:wave: Hi there! :wave:

And thank you for opening your first issue! We will get back to you shortly. :runner: :dash:

Oct 16 '24 21:10 stdlib-bot

Punctuation is broken with prefixes/suffixes as well. I can make a new issue if need be.

console.log(sentencize('I said "Look out" right before he banged his head'));
> [ 'I said "Look out" right before he banged his head' ] // This is correct

console.log(sentencize('I said "Look out!" right before he banged his head'));
> ['I said "Look out!"', 'right before he banged his head'] // This should be one sentence

Oct 17 '24 19:10 Pupix

@Pupix Thanks for flagging these issues! A separate issue would be a good idea for that. I will be looking into these shortly.

Oct 17 '24 20:10 Planeshifter

hello I started working on the bug, I've made some progress but I am still testing out some edge cases

Nov 04 '24 22:11 andrejTodoroski21

Hey @Planeshifter , I have created a PR for this issue as this nlp-sentencize has its own repo not accepting PR over there , Could you please help me understand where I should make this PR?

Nov 10 '24 15:11 toffee-k21

I also want to submit a PR for this issue, please tell how to submit my PR

Nov 10 '24 15:11 MVARUNREDDY8203

For anyone hoping to work on this issue, note that, IMO, the only appropriate way to provide a robust implementation is by implementing the Unicode sentence segmentation algorithm.

Nov 13 '24 00:11 kgryte