fuzzyjoin
fuzzyjoin copied to clipboard
Would be great for geo_join to use k-d trees
geo_join does a full m*n comparison of points. This works for moderately sized datasets but becomes intractable in both time and memory.
The right geo_join() implementation would likely use k-d trees to build a spatial index of one table, then look up points from the other, while taking advantage of the limited max_dist argument. This would turn it (on average) from O(m*n) time to O(m*log(n)) time. (Similar approaches could be used for functions such as distance_join, and on a related note b-k trees could be used for Levenshtein distance).
I may or may not have time to implement this but it would be a top-notch pull request.