PolyFuzz icon indicating copy to clipboard operation
PolyFuzz copied to clipboard

Grouping issue when TFIDF.min_similarity < link_min_similarity

Open colasri opened this issue 3 years ago • 1 comments

In the code below (with output in attached picture) I perform a simple TFIDF matching of ["apple", "apples", "appl", "recal", "happy"].

The initial min_similarity is set to 0.2. The similarity of happy and appl is 0.24.

When grouping with a link_min_similarity of 0.5, happy should not belong in the apples group, though that's what happens in the output of .get_matches(), it is in the apples group.

It appears it is not in the cluster though.

grouping

Plain text code:

from polyfuzz import PolyFuzz
from polyfuzz.models import TFIDF

from_list = ["apple", "apples", "appl", "recal", "happy"]
matcher = TFIDF(min_similarity=0.2)
model = PolyFuzz(matcher).match(from_list)
cm = model.cluster_mappings
model.group(link_min_similarity=0.5, group_all_strings=True)
print(model.get_matches())

colasri avatar Jul 22 '22 10:07 colasri

I am not entirely sure but there seems to be an issue with the group_all_strings parameter combined with link_min_similarity. What most likely is happening is that (appl, apple) gets into the cluster apples and (happy, appl) gets into the same cluster because it shared appl. I'll have to dig a little deeper to figure this stuff out but I'll make sure it gets released in the next version!

MaartenGr avatar Jul 23 '22 08:07 MaartenGr