SimhashIndex.get_near_dups lost the order of similarity of index
I custom it by like below:
`
ans = PriorityQueue()
for key in self.get_keys(simhash):
dups = self.bucket[key]
self.log.debug('key:%s', key)
if len(dups) > 200:
self.log.warning('Big bucket found. key:%s, len:%s', key, len(dups))
for dup in dups:
sim2, obj_id = dup.split(',', 1)
sim2 = Simhash(long(sim2, 16), self.f)
d = simhash.distance(sim2)
if d <= self.k:
ans.put((d, obj_id))
res = []
tmp = {}
while not ans.empty():
d, obj_id = ans.get()
if obj_id not in tmp:
res.append(str(obj_id))
tmp[obj_id] = 1`
Hi, thanks for reaching out. Actually I don't understand your question. Could you describe a bit more? If you can add your expected result and the actual output, that would help.
In your origin code, the results of SimhashIndex.get_near_dups dont maintain the similar order. If there are several results, which is the most similar?
In your origin code, the results of SimhashIndex.get_near_dups dont maintain the similar order. If there are several results, which is the most similar?
yes,that's what you mean.
add a PriorityQueue() to get the similar order.
@bobkentt That sounds like a good optimization. Would you like to create a pull request for that?