address-matching Ran address matching, here's a report

Just ran dedupe on my dataset!

Learning

During the learning step, I labeled 2012 examples. Here's what the labels ended up being:

yes: 1, no: 204, unsure: 7

The "yes" example came up in the first three, then there were no more yeses.

After about ~30 "no's", I got a rash of address comparison where the street number was off by a single digit or two i.e. 5136 S Tripp Ave and 5135 S Tripp Ave.

I wasn't sure whether to research these addresses in order to answer them truthfully, because as many of those might be the same building as not (I now realize that's an assumption that maybe dedupe could have dealt with, by outputting the right "rate" of these kinds of guesses). So I labeled them all "unsure." Then they stopped appearing and I did hundreds of "nos".

Clustering

Total running time: 2:30

Here's the shell output

INFO:dedupe.api:3 folds
INFO:dedupe.crossvalidation:using cross validation to find optimum alpha...
INFO:dedupe.crossvalidation:optimum alpha: 1.000000
INFO:dedupe.api:Learned Weights
INFO:dedupe.api:('address', -0.03654640167951584)
INFO:dedupe.api:('bias', -2.899808406829834)
INFO:dedupe.blocking:Calculating coverage of simple predicates
INFO:dedupe.blocking:Calculating coverage of tf-idf predicates
INFO:dedupe.blocking:defaultdict(<type 'set'>, {})
INFO:dedupe.tfidf:Canopy: TF-IDF:0.4address
INFO:dedupe.tfidf:Canopy: TF-IDF:0.6address
INFO:dedupe.tfidf:Canopy: TF-IDF:0.2address
INFO:dedupe.tfidf:Canopy: TF-IDF:0.8address
INFO:dedupe.blocking:coverage threshold: 32207
INFO:dedupe.blocking:Before removing liberal predicates, 13 predicates
INFO:dedupe.blocking:After removing liberal predicates, 13 predicates
INFO:dedupe.blocking:Final predicate set:
INFO:dedupe.blocking:[('wholeFieldPredicate', 'address')]
INFO:dedupe.blocking:defaultdict(<type 'set'>, {'address': set(['ave', 's', 'st', 'w', 'n'])})
INFO:dedupe.blocking:0, 0.0000812 seconds

...

INFO:dedupe.api:Maximum expected recall and precision
INFO:dedupe.api:recall: 1.000
INFO:dedupe.api:precision: 0.051
INFO:dedupe.api:With threshold: 0.051
clustering...
duplicate sets 368645

Accuracy test

As a quickie accuracy test, I eyeballed the first hundred records of the output csv.

100 of the 100 first records were identical address matches.

So my sample precision is 100%. Who knows what the sample recall is.

Next steps

@fgregg @derekeder, any thoughts on what this means?

Did this perform as expected, given that we were comparison on a single address field?

Next, I'm going to measure how many buildings with no building age data had a match (positive or not).

Mar 26 '14 14:03 jpvelez

Notice that the blocking rule that it learned was "Whole Field Predicate." That means dedupe is only considering records where the address fields completely match. That's why you got the results you did.

You need more positive examples in order to learn a good blocking predicate (about 10 should be fine).

Sometimes you just have a bad training run. Give it another go

As an aside, 5136 S Tripp Ave and 5135 S Tripp Ave. are almost certainly not the same building since they are on different sides of the street.

If you still can't get a good run, you may need to add some more features to the model. One thing I would consider is adding a feature of whether the trailing digit is odd or even.

On Wed, Mar 26, 2014 at 9:31 AM, Juan-Pablo Velez [email protected]:

Just ran dedupe on my dataset! Learning

During the learning step, I labeled 2012 examples. Here's what the labels ended up being:

yes: 1, no: 204, unsure: 7

The "yes" example came up in the first three, then there were no more yeses.

After about ~30 "no's", I got a rash of address comparison where the street number was off by a single digit or two i.e. 5136 S Tripp Ave and 5135 S Tripp Ave.

I wasn't sure whether to research these addresses in order to answer them truthfully, because as many of those might be the same building as not (I now realize that's an assumption that maybe dedupe could have dealt with, by outputting the right "rate" of these kinds of guesses). So I labeled them all "unsure." Then they stopped appearing and I did hundreds of "nos". Clustering

Total running time: 2:30

Here's the shell output

INFO:dedupe.api:3 foldsINFO:dedupe.crossvalidation:using cross validation to find optimum alpha...INFO:dedupe.crossvalidation:optimum alpha: 1.000000INFO:dedupe.api:Learned WeightsINFO:dedupe.api:('address', -0.03654640167951584)INFO:dedupe.api:('bias', -2.899808406829834)INFO:dedupe.blocking:Calculating coverage of simple predicatesINFO:dedupe.blocking:Calculating coverage of tf-idf predicatesINFO:dedupe.blocking:defaultdict(<type 'set'>, {})INFO:dedupe.tfidf:Canopy: TF-IDF:0.4addressINFO:dedupe.tfidf:Canopy: TF-IDF:0.6addressINFO:dedupe.tfidf:Canopy: TF-IDF:0.2addressINFO:dedupe.tfidf:Canopy: TF-IDF:0.8addressINFO:dedupe.blocking:coverage threshold: 32207INFO:dedupe.blocking:Before removing liberal predicates, 13 predicatesINFO:dedupe.blocking:After removing liberal predicates, 13 predicatesINFO:dedupe.blocking:Final predicate set:INFO:dedupe.blocking:[('wholeFieldPredicate', 'address')]INFO:dedupe.blocking:defaultdict(<type 'set'>, {'address': set(['ave', 's', 'st', 'w', 'n'])})INFO:dedupe.blocking:0, 0.0000812 seconds ... INFO:dedupe.api:Maximum expected recall and precisionINFO:dedupe.api:recall: 1.000INFO:dedupe.api:precision: 0.051INFO:dedupe.api:With threshold: 0.051clustering...duplicate sets 368645

Accuracy test

As a quickie accuracy test, I eyeballed the first hundred records of the output csv.

100 of the 100 first records were identical address matches.

So my sample precision is 100%. Who knows what the sample recall is. Next steps

@forestgregg @derekeder https://github.com/derekeder, any thoughts on what this means?

Did this perform as expected, given that we were comparison on a single address field?

Next, I'm going to measure how many buildings with no building age data had a match (positive or not).

— Reply to this email directly or view it on GitHubhttps://github.com/datamade/address-matching/issues/7 .

773.888.2718 2231 N. Monticello Ave Chicago, IL 60647

Mar 26 '14 14:03 fgregg

Hmm... actually let me think about this a little more. One thing could be that you are not getting enough true duplicates in the training sample.

What's are the sizes of the messy data set and the canonical data set?

Mar 26 '14 15:03 fgregg

canonical dataset (subset of building footprints with addresses. all unique addresses.) - 481,709 rows messy dataset (supplemental building ages. all uniquified addresses, started off as unique PINs .) - 441,909 rows

Let me know if you want me to commit full code / data, they're not up in my repo right now.

Mar 26 '14 15:03 jpvelez

Hmm.. yeah.. with data that size true duplicates are going to be rare. Increase the sample size by an order of magnitude. If that's not enough, do it again.

On Wed, Mar 26, 2014 at 10:23 AM, Juan-Pablo Velez <[email protected]

wrote:

messy dataset (supplemental building ages) - 441909 lines

— Reply to this email directly or view it on GitHubhttps://github.com/datamade/address-matching/issues/7#issuecomment-38697078 .

773.888.2718 2231 N. Monticello Ave Chicago, IL 60647

Mar 26 '14 15:03 fgregg

And what am I looking for when I do that? More true duplicates to label?

What additional blocking predicate might it learn besides matching the entire field, and how would that help? (i.e. is there any way to TELL when addresses are part of the same building?)

Mar 26 '14 15:03 jpvelez

You are looking for more true duplicates to label.

The predicate that you learned is the "whole field predicate" that means dedupe will only compare records that have identical address fields. Think about why that's a problem. To see what other predicates are possible, check out https://github.com/datamade/dedupe/blob/master/dedupe/predicates.py

How can you tell when addresses are part of the same building... that's not a technical question.

On Wed, Mar 26, 2014 at 10:31 AM, Juan-Pablo Velez <[email protected]

wrote:

And what am I looking for when I do that? More true duplicates to label?

What additional blocking predicate might it learn besides matching the entire field, and how would that help? (i.e. is there any way to TELL when addresses are part of the same building?)

— Reply to this email directly or view it on GitHubhttps://github.com/datamade/address-matching/issues/7#issuecomment-38698198 .

773.888.2718 2231 N. Monticello Ave Chicago, IL 60647

Mar 26 '14 15:03 fgregg