fuzzyjoin icon indicating copy to clipboard operation
fuzzyjoin copied to clipboard

Would be great for geo_join to use k-d trees

Open dgrtwo opened this issue 7 years ago • 0 comments

geo_join does a full m*n comparison of points. This works for moderately sized datasets but becomes intractable in both time and memory.

The right geo_join() implementation would likely use k-d trees to build a spatial index of one table, then look up points from the other, while taking advantage of the limited max_dist argument. This would turn it (on average) from O(m*n) time to O(m*log(n)) time. (Similar approaches could be used for functions such as distance_join, and on a related note b-k trees could be used for Levenshtein distance).

I may or may not have time to implement this but it would be a top-notch pull request.

dgrtwo avatar Feb 08 '18 18:02 dgrtwo