fuzzyjoin
fuzzyjoin copied to clipboard
Vignette for fuzzy_inner_join
I can't find an example for how match_fun works anywhere online
Related: Are difference_join , distance_join , regex_join, geo_join etc wrappers for fuzzy_join ? If so, what are their match_fun 's and could these be documented somewhere? I saw https://stackoverflow.com/a/40117784/4663008 and it looks like str_detect could be the match_fun for regex_join ?
Update edit: ok I figured that I could do it myself and looked in the code to pull out some of the simpler match_fun's :)
- regex():
match_fun <- function(v1, v2) {
stringr::str_detect(v1, v2)
}
- difference(max_dist):
match_fun <- function(v1, v2) {
dist <- abs(v1 - v2)
ret <- data.frame(include = (dist <= max_dist))
if (!is.null(distance_col)) {
ret[[distance_col]] <- dist
}
ret
}
- distance(method, max_dist):
match_fun <- function(v1, v2) {
if (method == "euclidean") {
d <- sqrt(rowSums((v1 - v2) ^ 2))
} else if (method == "manhattan") {
d <- rowSums(abs(v1 - v2))
}
ret <- dplyr::data_frame(instance = d <= max_dist)
if (!is.null(distance_col)) {
ret[[distance_col]] <- d
}
ret
}
- stringdist(max_dist, ignore_case, method):
match_fun <- function(v1, v2) {
if (ignore_case) {
v1 <- stringr::str_to_lower(v1)
v2 <- stringr::str_to_lower(v2)
}
# shortcut for Levenshtein-like methods: if the difference in
# string length is greater than the maximum string distance, the
# edit distance must be at least that large
# length is much faster to compute than string distance
if (method %in% c("osa", "lv", "dl")) {
length_diff <- abs(stringr::str_length(v1) - stringr::str_length(v2))
include <- length_diff <= max_dist
dists <- rep(NA, length(v1))
dists[include] <- stringdist::stringdist(v1[include], v2[include], method = method, ...)
} else {
# have to compute them all
dists <- stringdist::stringdist(v1, v2, method = method, ...)
}
ret <- dplyr::data_frame(include = (dists <= max_dist))
if (!is.null(distance_col)) {
ret[[distance_col]] <- dists
}
ret
}
- interval_join(...
here is a simple interval example https://stackoverflow.com/a/41136551/4663008
match_fun = list(`>=`, `<=`)