Adding similarity column in the group_similar_strings output
Hi,
Thank you for this amazing code working just great so far in my use case. Please, How can I add the similarities values from the computed cosine in the outputted result of group_similar_strings functions? The output I am trying to make is a pandas.series containing the duplicated name with their respective cosine similarities value regarding the deduplicated_name.
So it would be something like this: Line Number | Company Name | Company CIK Key | Similarity | deduplicated_name
Please any help? Thank you.
Hi @selfcontrol7WC
The deduplicated name can be seen as kind of a "group identifier", where all strings that have the same deduplicate_name belong to the same group. All strings in that group are similar to each other, and the group identifier is just a "random" string within the group of similar strings. So its not necessary clear which similarity to pick. For example, suppose you have 3 similiar strings with the following similarities:
string_a - string_b - 0.80 string_a - string_c - 0.99 string_b - string_c - 0.75
The deduplicated name will be "string_a", but for entry "string_c" for example do you pick 0.99? That means the low similarity of 0.75 will be lost. It is also possible to have another string (string_d) with similarity 0.99 to string_c, but 0.74 to string_a. If your cutoff value is 0.75, there will be no similarity between string_a and d, but string_d will still be in the same group.
Another possibility is to show for each entry the lowest similarity it has with any strings in the group. I think this might give a better indication on how similar a string within a group is. I think this is possible to do with some hacking.
Hi,
Thank you for your prompt reply and your detailed explanation I appreciate it.
Yes, I clearly understand the tricky part regarding which random group identifier to pick and also the fact that the low accuracies will be lost. I did not think about this former.
In your example, if string_a is selected as the group identifier, I would therefore pick 0.99 for the accuracy of string_c but loos other similarities related to string _c then.
1. Thinking again about it, for simplicity, in my use case having the similarity values of each string within a group and their group identifier would be great for now.
How I see it:
Company Name | Similarity | deduplicated_name
string_a | 1 | string_a
string_b | 0.80 | string_a
string_c | 0.99 | string_a
string_d | 0.74 | string_a
2. Also, I like your approach to track and show the lowest similarities of each string within the group. In that case, I can not see it as part of the same single data frame returned by the group_similar_strings functions like in point 1 above. Is it going to be in a separate second data frame? Also, if we consider our same example, this data frame would be like a 3 dimensions data frame with the number of rows equals 10?
Something like this?
Company Name1 | Company Name2 | similarity
string_a | string_a | 1
string_a | string_b | 0.80
string_a | string_c | 0.99
string_a | string_d | 0.74
string_b | string_b | 1
string_b | string_c | 0.75
string_b | string_d | 0.77
string_c | string_c | 1
string_c | string_d | 0.99
string_b | string_d | 1
Sorry since I really have no idea where to start from, that's why I drew these tab to make it clear in my mind as well.
Please, can you guide me on how I can hack the code and get these results, Please?
Thank you again for your time.