natural about chinese

how to use classifier in chinese?

Jul 24 '14 08:07 ahl5esoft

classifier.addDocument('五裕紫菜片', '干货'); classifier.addDocument('优香岛桂皮', '干货'); classifier.addDocument('苗家辣妹辣椒', '干货'); classifier.addDocument('海博卷尺', '小五金'); classifier.addDocument('三达SD-156A双重过滤烟嘴', '小五金'); classifier.addDocument('波斯BS-I3091测电笔', '小五金'); classifier.train();

classifier.classify('紫菜') => 干货

classifier.classify('双重过滤') => 干货

classifier.classify('波斯') => 干货

why?

Jul 24 '14 08:07 ahl5esoft

The classifier relies on a tokenizer and stemmer so that could be part of the problem, I don't think we have a chinese stemmer at the moment and if you use the english one it will use the english tokenizer which probably wont help much.

This is part of the reason why we need #159, it could help ensure that when a tokenizer is used that its the correct language.

Jul 26 '14 13:07 kkoch986

I think Chinese Language doesn't need stemming at all, but how to tokenize a Chinese document will be a very painful job. );

Feb 28 '15 13:02 mike820324

Not sure is it possible, but i tried to applied nodejieba to classification and it seems work.

var nodejieba = require("nodejieba"); var natural = require('natural'), classifier = new natural.BayesClassifier();

classifier.addDocument(nodejieba.cut("红掌拨清波"), 'poem'); classifier.addDocument(nodejieba.cut("想睇戲"), 'action'); classifier.addDocument(nodejieba.cut("南京市长江大桥"), 'place'); classifier.train();

console.log(classifier.classify(nodejieba.cut('红掌拨清波'))); console.log(classifier.classify(nodejieba.cut("想睇戲"))); console.log(classifier.classify(nodejieba.cut('南京市长江大桥睇戲')));

Sep 19 '16 09:09 smilechun

so basically would be possible to add a TokenizerZh by using nodejieba.cut as tokenization function override?

Feb 07 '18 16:02 loretoparisi

You can use https://github.com/yishn/chinese-tokenizer for tokenization. Perhaps @Hugo-ter-Doest would like to add this directly to the package? Similar to the port done for Japanese tokenizer.

Jun 12 '22 06:06 titanism

Will look into this.

Jun 13 '22 11:06 Hugo-ter-Doest