fuzzyjoin icon indicating copy to clipboard operation
fuzzyjoin copied to clipboard

Vignette for fuzzy_inner_join

Open andrewnguyen42 opened this issue 9 years ago • 1 comments

I can't find an example for how match_fun works anywhere online

andrewnguyen42 avatar Jan 23 '17 15:01 andrewnguyen42

Related: Are difference_join , distance_join , regex_join, geo_join etc wrappers for fuzzy_join ? If so, what are their match_fun 's and could these be documented somewhere? I saw https://stackoverflow.com/a/40117784/4663008 and it looks like str_detect could be the match_fun for regex_join ?

Update edit: ok I figured that I could do it myself and looked in the code to pull out some of the simpler match_fun's :)

  1. regex():
  match_fun <- function(v1, v2) {
    stringr::str_detect(v1, v2)
  }
  1. difference(max_dist):
  match_fun <- function(v1, v2) {
    dist <- abs(v1 - v2)
    ret <- data.frame(include = (dist <= max_dist))
    if (!is.null(distance_col)) {
      ret[[distance_col]] <- dist
    }
    ret
}
  1. distance(method, max_dist):
  match_fun <- function(v1, v2) {
    if (method == "euclidean") {
      d <- sqrt(rowSums((v1 - v2) ^ 2))
    } else if (method == "manhattan") {
      d <- rowSums(abs(v1 - v2))
    }
    ret <- dplyr::data_frame(instance = d <= max_dist)
    if (!is.null(distance_col)) {
      ret[[distance_col]] <- d
    }
    ret
}
  1. stringdist(max_dist, ignore_case, method):
  match_fun <- function(v1, v2) {
    if (ignore_case) {
      v1 <- stringr::str_to_lower(v1)
      v2 <- stringr::str_to_lower(v2)
    }

    # shortcut for Levenshtein-like methods: if the difference in
    # string length is greater than the maximum string distance, the
    # edit distance must be at least that large

    # length is much faster to compute than string distance
    if (method %in% c("osa", "lv", "dl")) {
      length_diff <- abs(stringr::str_length(v1) - stringr::str_length(v2))
      include <- length_diff <= max_dist

      dists <- rep(NA, length(v1))

      dists[include] <- stringdist::stringdist(v1[include], v2[include], method = method, ...)
    } else {
      # have to compute them all
      dists <- stringdist::stringdist(v1, v2, method = method, ...)
    }
    ret <- dplyr::data_frame(include = (dists <= max_dist))
    if (!is.null(distance_col)) {
      ret[[distance_col]] <- dists
    }
    ret
}
  1. interval_join(...

here is a simple interval example https://stackoverflow.com/a/41136551/4663008

match_fun = list(`>=`, `<=`)

ahcyip avatar Jun 06 '17 02:06 ahcyip