fuzzyjoin icon indicating copy to clipboard operation
fuzzyjoin copied to clipboard

Adds stringsim_join functions as requested in #71

Open JBGruber opened this issue 5 years ago • 2 comments

I love the fuzzyjoin package and today I wanted to learn a little better how exactly it works. By coincidence, I stumbled across #71 and thought it was a pretty good idea to try and implement it, so I would understand the working of the package a bit better (but feel free to reject this as it was mainly a practice that turned out better than I thought).

The PR is still lacking some tests but I wanted to check if you are interested in adding these functions first.

For me, the main reason I want to work with similarity instead of distances is that they are standardized between 0 and 1 (at least most methods). Since I usually work with longer texts of heterogeneous lengths. Newspaper articles, for example, vary significantly in lengths and trying to find duplicates based on distance alone is basically impossible.

JBGruber avatar Oct 19 '20 16:10 JBGruber

Very nice, hopefully it will be implemented in the main branch! thank you.

emilBeBri avatar Oct 30 '20 09:10 emilBeBri