simhash icon indicating copy to clipboard operation
simhash copied to clipboard

SimhashIndex.get_near_dups lost the order of similarity of index

Open bobkentt opened this issue 6 years ago • 4 comments

I custom it by like below: `
ans = PriorityQueue()

    for key in self.get_keys(simhash):
        dups = self.bucket[key]
        self.log.debug('key:%s', key)
        if len(dups) > 200:
            self.log.warning('Big bucket found. key:%s, len:%s', key, len(dups))

        for dup in dups:
            sim2, obj_id = dup.split(',', 1)
            sim2 = Simhash(long(sim2, 16), self.f)

            d = simhash.distance(sim2)
            if d <= self.k:
                ans.put((d, obj_id))
    res = []
    tmp = {}
    while not ans.empty():
        d, obj_id = ans.get()
        if obj_id not in tmp:
            res.append(str(obj_id))
            tmp[obj_id] = 1`

bobkentt avatar Oct 10 '19 10:10 bobkentt

Hi, thanks for reaching out. Actually I don't understand your question. Could you describe a bit more? If you can add your expected result and the actual output, that would help.

1e0ng avatar Oct 10 '19 14:10 1e0ng

In your origin code, the results of SimhashIndex.get_near_dups dont maintain the similar order. If there are several results, which is the most similar?

Chuanyunux avatar Apr 08 '20 02:04 Chuanyunux

In your origin code, the results of SimhashIndex.get_near_dups dont maintain the similar order. If there are several results, which is the most similar?

yes,that's what you mean.

add a PriorityQueue() to get the similar order.

bobkentt avatar Jun 30 '20 05:06 bobkentt

@bobkentt That sounds like a good optimization. Would you like to create a pull request for that?

1e0ng avatar Jul 01 '20 16:07 1e0ng