natural WordPunctTokenizer incorrectly handles parenthetical expresssions

Not sure if this is intentional or not:

var tokenizer = new natural.WordPunctTokenizer();
console.log(tokenizer.tokenize("Example sentence (with parenthetical expression)."));

outputs:

 [ 'Example',
  'sentence',
  ' (',
  'with',
  'parenthetical',
  'expression',
  ').' ]

The left paren has a space in front of it, and the right paren and period are a single token instead of two separate tokens. If someone can clarify that this is not right, I'll try to fix it.

Sep 23 '14 15:09 mbc1990

certainly looks weird to me, i would say the correct answer would probably be

[ 'Example',
  'sentence',
  '(',
  'with',
  'parenthetical',
  'expression',
  ')',
  '.' ]

Sep 23 '14 15:09 kkoch986

The RegExpTokenizer code looks a bit funky! Such as, the RegExp constructor takes a RegExp literal (L81);
(why does it have a i flag?)
/(\w+|[\s\S])/g seems to work for me, it produces the output as specified by @kkoch986. I’ve only tested it in the console tho.
However, L77 specifies how ). should, according to Natural, not be split.

Sep 23 '14 16:09 wooorm

@wooorm you're right, thats what i get for not actually looking at the code before replying.

I would maintain that the space should probably be removed, but it seems from looking that regex used that grouping consecutive punctuation is intentional. Its purpose is to essentially group the string into consecutive alphanumeric and consecutive non-alphanumeric groups.

@mbc1990 I think what you might be looking for is the regex given by @wooorm above, something like this:

var WordTokenizer = function(options) {
    this._pattern = new RegExp(/(\w+|[\s\S])/g);
    RegexpTokenizer.call(this,options)
};

util.inherits(WordTokenizer, RegexpTokenizer);
exports.WordTokenizer = WordTokenizer;

Also for the record, i agree the /i is no needed as it is covered by \w+ which is really saying [A-Za-z0-9_]+

Sep 23 '14 16:09 kkoch986

@mbc1990 have you tried passing {gaps: false} to WordPunctTokenizer? That might help?

@kkoch986 new RegExp(/(\w+|[\s\S])/g), should really be new /(\w+|[\s\S])/g (the same goes for the code in the example (although it doesn’t really hurt))

Sep 23 '14 16:09 wooorm

Sorry guys, that was my RegEx way back in the day. The code has already been changed from what is listed above. Im not sure if this is still an issue or not.

Mar 28 '15 17:03 silentrob