flashtext icon indicating copy to clipboard operation
flashtext copied to clipboard

Chinese text extraction is not correct

Open haifenghuang opened this issue 8 years ago • 2 comments

As Title suggested, Below code:

keyword_processor := NewKeywordProcessor()
keyword_processor.AddKeywords("欢迎")
keyword_processor.AddKeywords("来")
keyword_processor.AddKeywords("北京")

result := keyword_processor.ExtractKeywords("欢迎来北京")

for _, v := range result {
    e := ExtractResult(v)
    fmt.Printf("return : %s\n", e.Keyword)
}

There is nothing in the output, because len(result) = 0.

If we change above keywords to english:

keyword_processor := NewKeywordProcessor()
keyword_processor.AddKeywords("welcome")
keyword_processor.AddKeywords("to")
keyword_processor.AddKeywords("beijing")

result := keyword_processor.ExtractKeywords("welcome to beijing")

for _, v := range result {
    e := ExtractResult(v)
    fmt.Printf("return : %s\n", e.Keyword)
}

The result is:

return : welcome
return : to
return : beijing

haifenghuang avatar Jan 17 '18 15:01 haifenghuang

Sorry, It didn't support Chinese for now on.. English use spaces to separate words, but not Chinese. I did consider to add this feature in this repo, but at last I thought it will be better to build a new tool to extract Chinese sentences.

sundy-li avatar Jan 18 '18 01:01 sundy-li

As Title suggested, Below code:

keyword_processor := NewKeywordProcessor()
keyword_processor.AddKeywords("欢迎")
keyword_processor.AddKeywords("来")
keyword_processor.AddKeywords("北京")

result := keyword_processor.ExtractKeywords("欢迎来北京")

for _, v := range result {
    e := ExtractResult(v)
    fmt.Printf("return : %s\n", e.Keyword)
}

There is nothing in the output, because len(result) = 0.

If we change above keywords to english:

keyword_processor := NewKeywordProcessor()
keyword_processor.AddKeywords("welcome")
keyword_processor.AddKeywords("to")
keyword_processor.AddKeywords("beijing")

result := keyword_processor.ExtractKeywords("welcome to beijing")

for _, v := range result {
    e := ExtractResult(v)
    fmt.Printf("return : %s\n", e.Keyword)
}

The result is:

return : welcome
return : to
return : beijing

Hi hiafenghuang I did a similar job in recent work about flashtext with Chinese support.

	keywordProcessor := gf.NewKeywordProcessor()
	keywordProcessor.AddKeyword("欢迎")
	keywordProcessor.AddKeyword("来")
	keywordProcessor.AddKeyword("北京")

	result := keywordProcessor.ExtractKeywords("欢迎来北京")

	for _, v := range result {
		fmt.Printf("return : %s\n", v)
	}

And the result is

return : 欢迎
return : 来
return : 北京

The package is here.

Besides, I used PyFlashtext which is also with similar Chinese problems and I fixed it. To improve the performance in my product env, I rewrite FlashText algorithm with go instead of python. And it works well. Welcome to use go-flashtext.

waltsmith88 avatar Aug 28 '19 05:08 waltsmith88