natural icon indicating copy to clipboard operation
natural copied to clipboard

about chinese

Open ahl5esoft opened this issue 11 years ago • 7 comments

how to use classifier in chinese?

ahl5esoft avatar Jul 24 '14 08:07 ahl5esoft

classifier.addDocument('五裕紫菜片', '干货'); classifier.addDocument('优香岛桂皮', '干货'); classifier.addDocument('苗家辣妹辣椒', '干货'); classifier.addDocument('海博卷尺', '小五金'); classifier.addDocument('三达SD-156A双重过滤烟嘴', '小五金'); classifier.addDocument('波斯BS-I3091测电笔', '小五金'); classifier.train();

classifier.classify('紫菜') => 干货

classifier.classify('双重过滤') => 干货

classifier.classify('波斯') => 干货

why?

ahl5esoft avatar Jul 24 '14 08:07 ahl5esoft

The classifier relies on a tokenizer and stemmer so that could be part of the problem, I don't think we have a chinese stemmer at the moment and if you use the english one it will use the english tokenizer which probably wont help much.

This is part of the reason why we need #159, it could help ensure that when a tokenizer is used that its the correct language.

kkoch986 avatar Jul 26 '14 13:07 kkoch986

I think Chinese Language doesn't need stemming at all, but how to tokenize a Chinese document will be a very painful job. );

mike820324 avatar Feb 28 '15 13:02 mike820324

Not sure is it possible, but i tried to applied nodejieba to classification and it seems work.

var nodejieba = require("nodejieba"); var natural = require('natural'), classifier = new natural.BayesClassifier();

classifier.addDocument(nodejieba.cut("红掌拨清波"), 'poem'); classifier.addDocument(nodejieba.cut("想睇戲"), 'action'); classifier.addDocument(nodejieba.cut("南京市长江大桥"), 'place'); classifier.train();

console.log(classifier.classify(nodejieba.cut('红掌拨清波'))); console.log(classifier.classify(nodejieba.cut("想睇戲"))); console.log(classifier.classify(nodejieba.cut('南京市长江大桥睇戲')));

smilechun avatar Sep 19 '16 09:09 smilechun

so basically would be possible to add a TokenizerZh by using nodejieba.cut as tokenization function override?

loretoparisi avatar Feb 07 '18 16:02 loretoparisi

You can use https://github.com/yishn/chinese-tokenizer for tokenization. Perhaps @Hugo-ter-Doest would like to add this directly to the package? Similar to the port done for Japanese tokenizer.

titanism avatar Jun 12 '22 06:06 titanism

Will look into this.

Hugo-ter-Doest avatar Jun 13 '22 11:06 Hugo-ter-Doest