Chinese text extraction is not correct
As Title suggested, Below code:
keyword_processor := NewKeywordProcessor()
keyword_processor.AddKeywords("欢迎")
keyword_processor.AddKeywords("来")
keyword_processor.AddKeywords("北京")
result := keyword_processor.ExtractKeywords("欢迎来北京")
for _, v := range result {
e := ExtractResult(v)
fmt.Printf("return : %s\n", e.Keyword)
}
There is nothing in the output, because len(result) = 0.
If we change above keywords to english:
keyword_processor := NewKeywordProcessor()
keyword_processor.AddKeywords("welcome")
keyword_processor.AddKeywords("to")
keyword_processor.AddKeywords("beijing")
result := keyword_processor.ExtractKeywords("welcome to beijing")
for _, v := range result {
e := ExtractResult(v)
fmt.Printf("return : %s\n", e.Keyword)
}
The result is:
return : welcome
return : to
return : beijing
Sorry, It didn't support Chinese for now on.. English use spaces to separate words, but not Chinese. I did consider to add this feature in this repo, but at last I thought it will be better to build a new tool to extract Chinese sentences.
As Title suggested, Below code:
keyword_processor := NewKeywordProcessor() keyword_processor.AddKeywords("欢迎") keyword_processor.AddKeywords("来") keyword_processor.AddKeywords("北京") result := keyword_processor.ExtractKeywords("欢迎来北京") for _, v := range result { e := ExtractResult(v) fmt.Printf("return : %s\n", e.Keyword) }There is nothing in the output, because
len(result) = 0.If we change above keywords to english:
keyword_processor := NewKeywordProcessor() keyword_processor.AddKeywords("welcome") keyword_processor.AddKeywords("to") keyword_processor.AddKeywords("beijing") result := keyword_processor.ExtractKeywords("welcome to beijing") for _, v := range result { e := ExtractResult(v) fmt.Printf("return : %s\n", e.Keyword) }The result is:
return : welcome return : to return : beijing
Hi hiafenghuang I did a similar job in recent work about flashtext with Chinese support.
keywordProcessor := gf.NewKeywordProcessor()
keywordProcessor.AddKeyword("欢迎")
keywordProcessor.AddKeyword("来")
keywordProcessor.AddKeyword("北京")
result := keywordProcessor.ExtractKeywords("欢迎来北京")
for _, v := range result {
fmt.Printf("return : %s\n", v)
}
And the result is
return : 欢迎
return : 来
return : 北京
The package is here.
Besides, I used PyFlashtext which is also with similar Chinese problems and I fixed it. To improve the performance in my product env, I rewrite FlashText algorithm with go instead of python. And it works well. Welcome to use go-flashtext.