stdlib icon indicating copy to clipboard operation
stdlib copied to clipboard

[BUG]: nlp-sentencize wrongly splits sentences with multiple punctuation marks

Open Pupix opened this issue 1 year ago • 3 comments

Description

Hello! Not sure if this is the right place, but can't post in the other repo.

Using @stdlib/[email protected] with phrases like 'HAPPY BIRTHDAY!!!' will incorrectly return a sentence for every punctuation mark:

console.log(sentencize('HAPPY BIRTHDAY!!!'));
> ['HAPPY BIRTHDAY!', '!', '!']

console.log(sentencize('what??'));
>  ['what?', '?']

console.log(sentencize('HOW DARE YOU?!?!'));
> ['HOW DARE YOU?', '!', '?', '!']

The above examples should be considered one sentence each

Weirdly enough it works well with ellipsis and phrases ending in !!!1!!11!!! and stuff like that. Such as:


console.log(sentencize('Yeah, about that...'));
> ['Yeah, about that...']

console.log(sentencize('OH EM GEE!!!1!!11!one!!1'));
> ['OH EM GEE!!!1!!11!one!!1']

This one is fine.

Cheers!

Related Issues

No response

Questions

No response

Demo

No response

Reproduction

const sentencize = require('@stdlib/nlp-sentencize'); console.log(sentencize('SURPRISE!!!'));

Expected Results

['SURPRISE!!!']

Actual Results

['SURPRISE!', '!', '!']

Version

0.2.2

Environments

Node.js

Browser Version

No response

Node.js / npm Version

v22.9.0

Platform

Windows 11

Checklist

  • [x] Read and understood the Code of Conduct.
  • [x] Searched for existing issues and pull requests.

Pupix avatar Oct 16 '24 21:10 Pupix

:wave: Hi there! :wave:

And thank you for opening your first issue! We will get back to you shortly. :runner: :dash:

stdlib-bot avatar Oct 16 '24 21:10 stdlib-bot

Punctuation is broken with prefixes/suffixes as well. I can make a new issue if need be.

console.log(sentencize('I said "Look out" right before he banged his head'));
> [ 'I said "Look out" right before he banged his head' ] // This is correct

console.log(sentencize('I said "Look out!" right before he banged his head'));
> ['I said "Look out!"', 'right before he banged his head'] // This should be one sentence

Pupix avatar Oct 17 '24 19:10 Pupix

@Pupix Thanks for flagging these issues! A separate issue would be a good idea for that. I will be looking into these shortly.

Planeshifter avatar Oct 17 '24 20:10 Planeshifter

hello I started working on the bug, I've made some progress but I am still testing out some edge cases

andrejTodoroski21 avatar Nov 04 '24 22:11 andrejTodoroski21

Hey @Planeshifter , I have created a PR for this issue as this nlp-sentencize has its own repo not accepting PR over there , Could you please help me understand where I should make this PR?

toffee-k21 avatar Nov 10 '24 15:11 toffee-k21

I also want to submit a PR for this issue, please tell how to submit my PR

MVARUNREDDY8203 avatar Nov 10 '24 15:11 MVARUNREDDY8203

For anyone hoping to work on this issue, note that, IMO, the only appropriate way to provide a robust implementation is by implementing the Unicode sentence segmentation algorithm.

kgryte avatar Nov 13 '24 00:11 kgryte