text-miner icon indicating copy to clipboard operation
text-miner copied to clipboard

Corpus#removeWords is not working properly with unicode characters

Open namirsab opened this issue 9 years ago • 0 comments

Observed

If you have a word like zurück in your documents, and you have this set of words to remove ['zur'] Then this step will remove zur in the word, converting zurück into ück. That's happening because the function is using word boundaries (\b) which are known not to work with Unicode.

Expected

  • [ ] the function uses an unicode compatible regexp.

namirsab avatar Jan 09 '17 10:01 namirsab