stdlib icon indicating copy to clipboard operation
stdlib copied to clipboard

[BUG]: nlp-sentencize wrongly breaks sentences in quotation marks

Open Pupix opened this issue 1 year ago • 1 comments

Description

As the title says.

Here are some quick examples

console.log(sentencize('I said "Look out" right before he banged his head'));
> [ 'I said "Look out" right before he banged his head' ] // This is correct

console.log(sentencize('I said "Look out!" right before he banged his head'));
> ['I said "Look out!"', 'right before he banged his head'] // This should be one sentence

From looking at the code it seems to be doing exactly as it's told, but doesn't seem quite right. Image If it's a suffix aka " and previous token is a punctuation mark .!?, then split.

Related Issues

#3013

Questions

No.

Demo

No response

Reproduction

console.log(sentencize('I said "Look out!" right before he banged his head'));
> ['I said "Look out!"', 'right before he banged his head']

Expected Results

['I said "Look out!" right before he banged his head']

Actual Results

['I said "Look out!"', 'right before he banged his head']

Version

0.2.2

Environments

Node.js

Browser Version

No response

Node.js / npm Version

v22.9.0

Platform

Windows 11

Checklist

  • [x] Read and understood the Code of Conduct.
  • [x] Searched for existing issues and pull requests.

Pupix avatar Oct 18 '24 00:10 Pupix

The Tool is likely splitting based on punctuation marks, it seems to be applying the case where the sentence ends with one of those punctuation marks, which in such cases isn't true.

The logic could be updated to check if the punctuation mark (!, ., ?) is within a quotation.

Srayash avatar Oct 18 '24 22:10 Srayash