skrub FEA Adding FuzzyJoin

This PR adds the new FuzzyJoin class that allows joining tables with dirty columns.

Co-authored-by: @LeoGrin

Jul 28 '22 12:07 jovan-stojanovic

Awesome! Some additional things which could be useful:

benchmark on datasets with typos, on abbreviations (@alexis-cvetkov was saying that we could use the is_abbrevation tag in YAGO for this), and maybe on many to many joins.
add a parameter to control for recall / precision tradeoff. It would probably be hard to set so I'm not sure. An idea: print the worst matches so the user knows if he needs more precision.
allow custom match dictionary
(maybe) add a option to require 100% for certain columns (when we have an id column but it's missing in some cases)

Jul 28 '22 18:07 LeoGrin

• allow custom match dictionary • (maybe) add a option to require 100% for certain columns (when we have an id column but it's missing in some cases)

Let's do these two later, in a second PR.

Thanks for the active back and forth, this is great!

Jul 28 '22 20:07 GaelVaroquaux

Thanks for this exciting PR. Two TODOs:

Fix the tests
Add an example. I have a hard time reviewing a PR without an example, because I cannot get a feeling on the user experience.

Aug 29 '22 09:08 GaelVaroquaux

Hi @jovan-stojanovic I'm curious why you went for Countvectorizer (e.g. with character n-grams) instead of, for example, something like the SimilarityEncoder.

I guess the "fuzzy" part comes from matching (for example) char n-grams which might still give close distance if there are some misspellings in fields/columns - but dirty_cat also covers this use case with the SimilarityEncoder, no?

(background is that we're thinking about how to do deduplication and how to use fuzzyjoin for that, but for this using string distances would be cool)

Thanks!

Sep 02 '22 09:09 mjboos

Hi @jovan-stojanovic I'm curious why you went for Countvectorizer (e.g. with character n-grams) instead of, for example, something like the SimilarityEncoder.

I guess the "fuzzy" part comes from matching (for example) char n-grams which might still give close distance if there are some misspellings in fields/columns - but dirty_cat also covers this use case with the SimilarityEncoder, no?

(background is that we're thinking about how to do deduplication and how to use fuzzyjoin for that, but for this using string distances would be cool)

Thanks!

Hi @mjboos , thanks for your comment. Both the SimilarityEncoder and FuzzyJoin are based on CountVectorizer, so they are similar in that way. However, we did not use the SimilarityEncoder directly because it can be very slow, because you need to compute for every category distances between them. From the user perspective, this is important. This is why we used Nearest Neighbors as it is much faster.

Here, there is an option return_distance=True that may help for deduplication.

However, you are right that the next challenge would be to add a better precision metric so as to distinguish between good and bad matches. It links to this discussion.

Maybe applying the SimilarityEncoder to n neairest neighbors would be an option then? Certainly better than applying it directly to all categories. This would be added as an option with precision='sim_enc' and a threshold.

In any case, thanks, you can create an issue, to be discussed as an additional feature in future PR's!

Sep 02 '22 10:09 jovan-stojanovic

Ok, thanks, working on fixing the tests and will add those changes in a few minutes.

Sep 02 '22 13:09 jovan-stojanovic

Code coverage is a bit low Can you analyze the report (button "details") and see what input configuration must be tested.

Sep 05 '22 19:09 GaelVaroquaux

Think I implemented most of your remarks! Please take a look when you have time, we can put it for release 0.3.

Sep 08 '22 08:09 jovan-stojanovic

Please take a look when you have time, we can put it for release 0.3.

I'm sorry, I don't want to rush this, or delay 0.3: we will not include this for 0.3.

Also, having it merged in the dev will enable to get feedback and modify things before releasing it to many users.

Sep 08 '22 08:09 GaelVaroquaux

Hi, Some feedbacks in terms of api:

The fact that when return_distance is True, the distances are returned in a separate variable bugs me a bit. In most of the use cases that I envision, I'll have to add that list into my joined table to link the matched record and their distance.
The values of the match_type parameter (nearest, radius) are not super clear to me. In both case we return the nearest neighbour so the wording might be a bit misleading. I also did not understand that match_threshold should only be used with radius mode.
The docstring description of radius method is not clear (when you talk about estimated precision and precision_threshold, this latter variable does not exist in the code btw).

Usage comments:

It was not clear to me that it was a 1-1 join search.
As you said, match_threshold seems to be quite hard to tune without trial and error. As the method is long to run, the process will be very long.
A too big value of match_threshold will lead to an error that is not very clear:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<timed exec> in <module>

/data.design/analytics/dsshome/config/projects/PLAYGROUNDDU/lib/python/fuzzy_join.py in fuzzy_join(left_table, right_table, on, return_distance, analyzer, ngram_range, match_type, match_threshold, suffixes, keep)
    234                 ]
    235             ),
--> 236             axis=1,
    237         )
    238 

<__array_function__ internals> in append(*args, **kwargs)

/data.design/analytics/dsshome/code-envs/python/data-disco/lib64/python3.6/site-packages/numpy/lib/function_base.py in append(arr, values, axis)
   4669         values = ravel(values)
   4670         axis = arr.ndim-1
-> 4671     return concatenate((arr, values), axis=axis)
   4672 
   4673 

<__array_function__ internals> in concatenate(*args, **kwargs)

ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)

Code comments:

To be verified, but my impression is that the radius argument is not used during the fit phase of NearestNeighbor. If it's the case we can move those lines out of the for loop.
Anw according to my profiling, the main bottleneck is in the query phase with n_neigh.radius_neighbors() that represents 75% of the running time.

Sep 14 '22 10:09 du-phan

One general comment on the API design here: we need to make sure that the behavior feels very similar to that of pandas.join https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html, so we should be using the same terminology, argument name, and return structure as much as possible.

Sep 15 '22 09:09 GaelVaroquaux

I wouldn't bother with a full experimental mechanism: "from dirty_cat.experimental import enable_fuzzy_join": we are not a very mature and stable package, and people know that. I would just write in the docs that the feature is experimental and I would raise a warning when the function is call

Sep 15 '22 11:09 GaelVaroquaux

I don't think that I would do a print_worse_match function for now. Rather do it in the example (and make sure that the return structure of fuzzy join makes this easy)

Sep 15 '22 11:09 GaelVaroquaux

we need to make sure that the behavior feels very similar to that of pandas.join

With this in mind, maybe I would:

rename "keep" to "how" (as in pandas)
rename "match_type" to "how_fuzzy", taking as values "closest_match", and "within_radius"
rename "match_threshold" to "fuzzy_radius"

These are of course very open suggestions and it would be good to gather the view of a variety of people on these choices/

Sep 15 '22 11:09 GaelVaroquaux

You might new a "plt.tight_layout()" before each plt.show() in the example, as the xlabel and ylabel do not appear

Oct 03 '22 14:10 GaelVaroquaux

We're having failing tests that seem unrelated to this PR: https://github.com/dirty-cat/dirty_cat/actions/runs/3175021096/jobs/5176907256

I don't like this: if we have a fragile test suite, we will be chasing more and more failing tests as we add features.

Oct 03 '22 19:10 GaelVaroquaux

It seems that the test failures that we are witnessing are independent from the PR. @jovan-stojanovic : can you confirm?

Yes, this is something that happened for the first time yesterday. With #371, I will try and resolve the min_hash_encoder (and gap_encoder) test so that we don't fetch data online. It is definitely linked to the fact that we need to download data from the internet for some tests.. ( See #376 and #379 )

Oct 04 '22 07:10 jovan-stojanovic

I tried using it, but the problem is that it will not work with our current minimal requirements. It requires the statsmodels package along with seaborn to work.

Let's add statsmodels to our doc dependencies

Oct 10 '22 11:10 GaelVaroquaux

You didn't add the dependency to statsmodels in the right place, I fear: https://app.circleci.com/pipelines/github/dirty-cat/dirty_cat/1032/workflows/83c23fe0-dfaf-4b27-aab6-f6a9f3f87cae/jobs/2300

Oct 10 '22 12:10 GaelVaroquaux

You need to add the dependency here: https://github.com/dirty-cat/dirty_cat/blob/master/build_tools/circle/build_doc.sh#L126

You should remove it to were you have added it previously.

Oct 10 '22 13:10 GaelVaroquaux

Merging! This is a great addition.

Oct 10 '22 13:10 GaelVaroquaux

Congrats!!

Oct 11 '22 18:10 LilianBoulard