string_grouper icon indicating copy to clipboard operation
string_grouper copied to clipboard

Some general questions about the package

Open eneszv opened this issue 3 years ago • 0 comments

Dear developer,

I've recently found this interesting package and I have a few questions. Not sure if this is the right place to post them.

  1. I'm working with a data set with company names that are being updated. The goal is to group them into the same entities, something like you presented in the example https://github.com/Bergvca/string_grouper/blob/master/tutorials/ignore_index_and_replace_na.md. Your matching algorithm is based on character-level N-grams and TF-IDF vectors. Because of that, I guess this algorithm is not deterministic and some old companies in the updated data might not match together. I just want to ask, do you have any experience working with dynamic data sets and any advice about whether is it worth trying this package?
  2. In the documentation https://bergvca.github.io/2017/10/14/super-fast-string-matching.html, you mentioned that for Levenstein distance, the amount of calculations grows quadratic. Actually, the complexity can be reduced to nlogn using an appropriate data structure, like BKTree. What is the computation complexity of this algorithm? Besides speed, is there any other reason why would you recommend using N-grams and TF-IDF instead of Levenstain-based metrics?
  3. Do functions match_strings and group_similar_strings have the same logic but different format of the output? Is it possible that companies A and B are grouped together with group_similar_strings but are not matched with match_strings if we used the same similarity threshold and same data set?

Thanks in advance!

eneszv avatar Aug 04 '22 11:08 eneszv