string_grouper Adding similarity column in the group_similar

Hi,

Thank you for this amazing code working just great so far in my use case. Please, How can I add the similarities values from the computed cosine in the outputted result of group_similar_strings functions? The output I am trying to make is a pandas.series containing the duplicated name with their respective cosine similarities value regarding the deduplicated_name.

So it would be something like this: Line Number | Company Name | Company CIK Key | Similarity | deduplicated_name

Please any help? Thank you.

Nov 04 '20 05:11 selfcontrol7WC

Hi @selfcontrol7WC

The deduplicated name can be seen as kind of a "group identifier", where all strings that have the same deduplicate_name belong to the same group. All strings in that group are similar to each other, and the group identifier is just a "random" string within the group of similar strings. So its not necessary clear which similarity to pick. For example, suppose you have 3 similiar strings with the following similarities:

string_a - string_b - 0.80 string_a - string_c - 0.99 string_b - string_c - 0.75

The deduplicated name will be "string_a", but for entry "string_c" for example do you pick 0.99? That means the low similarity of 0.75 will be lost. It is also possible to have another string (string_d) with similarity 0.99 to string_c, but 0.74 to string_a. If your cutoff value is 0.75, there will be no similarity between string_a and d, but string_d will still be in the same group.

Another possibility is to show for each entry the lowest similarity it has with any strings in the group. I think this might give a better indication on how similar a string within a group is. I think this is possible to do with some hacking.

Nov 04 '20 19:11 Bergvca

Hi,

Thank you for your prompt reply and your detailed explanation I appreciate it.

Yes, I clearly understand the tricky part regarding which random group identifier to pick and also the fact that the low accuracies will be lost. I did not think about this former.

In your example, if string_a is selected as the group identifier, I would therefore pick 0.99 for the accuracy of string_c but loos other similarities related to string _c then.

1. Thinking again about it, for simplicity, in my use case having the similarity values of each string within a group and their group identifier would be great for now.

2. Also, I like your approach to track and show the lowest similarities of each string within the group. In that case, I can not see it as part of the same single data frame returned by the group_similar_strings functions like in point 1 above. Is it going to be in a separate second data frame? Also, if we consider our same example, this data frame would be like a 3 dimensions data frame with the number of rows equals 10?

Sorry since I really have no idea where to start from, that's why I drew these tab to make it clear in my mind as well.

Please, can you guide me on how I can hack the code and get these results, Please?

Thank you again for your time.

Nov 05 '20 03:11 selfcontrol7WC

Adding similarity column in the group_similar_strings output