natural icon indicating copy to clipboard operation
natural copied to clipboard

WordPunctTokenizer incorrectly handles parenthetical expresssions

Open mbc1990 opened this issue 11 years ago • 5 comments

Not sure if this is intentional or not:

var tokenizer = new natural.WordPunctTokenizer();
console.log(tokenizer.tokenize("Example sentence (with parenthetical expression)."));

outputs:

 [ 'Example',
  'sentence',
  ' (',
  'with',
  'parenthetical',
  'expression',
  ').' ]

The left paren has a space in front of it, and the right paren and period are a single token instead of two separate tokens. If someone can clarify that this is not right, I'll try to fix it.

mbc1990 avatar Sep 23 '14 15:09 mbc1990

certainly looks weird to me, i would say the correct answer would probably be

[ 'Example',
  'sentence',
  '(',
  'with',
  'parenthetical',
  'expression',
  ')',
  '.' ]

kkoch986 avatar Sep 23 '14 15:09 kkoch986

  • The RegExpTokenizer code looks a bit funky! Such as, the RegExp constructor takes a RegExp literal (L81);
  • (why does it have a i flag?)
  • /(\w+|[\s\S])/g seems to work for me, it produces the output as specified by @kkoch986. I’ve only tested it in the console tho.
  • However, L77 specifies how ). should, according to Natural, not be split.

wooorm avatar Sep 23 '14 16:09 wooorm

@wooorm you're right, thats what i get for not actually looking at the code before replying.

I would maintain that the space should probably be removed, but it seems from looking that regex used that grouping consecutive punctuation is intentional. Its purpose is to essentially group the string into consecutive alphanumeric and consecutive non-alphanumeric groups.

@mbc1990 I think what you might be looking for is the regex given by @wooorm above, something like this:

var WordTokenizer = function(options) {
    this._pattern = new RegExp(/(\w+|[\s\S])/g);
    RegexpTokenizer.call(this,options)
};

util.inherits(WordTokenizer, RegexpTokenizer);
exports.WordTokenizer = WordTokenizer;

Also for the record, i agree the /i is no needed as it is covered by \w+ which is really saying [A-Za-z0-9_]+

kkoch986 avatar Sep 23 '14 16:09 kkoch986

@mbc1990 have you tried passing {gaps: false} to WordPunctTokenizer? That might help?

@kkoch986 new RegExp(/(\w+|[\s\S])/g), should really be new /(\w+|[\s\S])/g (the same goes for the code in the example (although it doesn’t really hurt))

wooorm avatar Sep 23 '14 16:09 wooorm

Sorry guys, that was my RegEx way back in the day. The code has already been changed from what is listed above. Im not sure if this is still an issue or not.

silentrob avatar Mar 28 '15 17:03 silentrob