hjy2588818 comments

Results 16 comments of


                                            hjy2588818

Does this class support elements in Chinese?

把默认的英文分词换成中文的分词准确率就基本上OK了

Does this class support elements in Chinese?

@shtse8 我是GaussianComparator(30) 这里给的30

Does this class support elements in Chinese?

@shtse8 $simhash = new \Tga\SimHash\SimHash(); $extractor = new \Tga\SimHash\Extractor\SimpleTextExtractor(); // 分词 $comparator = new \Tga\SimHash\Comparator\GaussianComparator(30); $fp1 = $simhash->hash($this->get_scws($text1), \Tga\SimHash\SimHash::SIMHASH_64); // die; $fp2 = $simhash->hash($this->get_scws($text2), \Tga\SimHash\SimHash::SIMHASH_64); // $fp1 = "1001010101010101000011000111010010001010010111110001000000000000"; //...

Does this class support elements in Chinese?

http://www.cnblogs.com/maybe2030/p/5203186.html 每16个字符分割，不知道是怎么存的MySQL然后加快比较的

Does this class support elements in Chinese?

@shtse8 你上面那两串指纹，我计算出来是0.97314496305805

Does this class support elements in Chinese?

GaussianComparator(30) 貌似这个值不能乱给。。。

Does this class support elements in Chinese?

尴尬了，他默认给的值是3，我参考了别的simhash在线计算的，看到30的时候比较接近。但是30应该是错的，不是这么用的，现在拿实际数据测试，发现很有问题，低于0.98以下的，两篇不一样的内容，但是有一些相同关键词的也能计算成这么多。现在迫切解决这个问题，我的网站有很多相似的内容，需要清除这些，不然成垃圾站了

Does this class support elements in Chinese?

@shtse8 我就默认给3了，数据没有直接删，大于0.25的存在一张表里面，我用第三方在线检测的查询验证（ http://life.chacuo.net/convertsimilar ），发现基本上我这边大于0.45的相似度在90%以上，我的分词跟你的不一样，具体的懒得管了，反正现在比之前的效率高了很多了，我就把大于0.8的全部干掉你的分词用的和我的不一样，所以你还得实际验证一下，看一下认定是重复数据阀值在哪里我把每篇新增的数据文档实时计算出来，然后取前16位，找到所有前16位一样的，然后做64位的全部比较 123456.... 12 13 14 15 16 23 24 25 26 34 35 36 45 46 这样，就只需要计算少量的数据了

Does this class support elements in Chinese?

@shtse8 感觉还是不太准确，两篇不一样的内容，计算出来的hash指纹居然是1

Does this class support elements in Chinese?

@shtse8 上传到GitHub了么，求共享