csvlink latlong comparator failing
Hi,
Attempting to link two CSV files and the latlong comparator is failing because the fields are being treated as strings.
Error:
INFO:root:taking a sample of 150000 possible pairs Traceback (most recent call last): File "/usr/local/bin/csvlink", line 11, in <module> sys.exit(launch_new_instance()) File "/usr/local/lib/python3.6/site-packages/csvdedupe/csvlink.py", line 210, in launch_new_instance d.main() File "/usr/local/lib/python3.6/site-packages/csvdedupe/csvlink.py", line 134, in main deduper.sample(nonexact_1, nonexact_2, self.sample_size) File "/usr/local/lib/python3.6/site-packages/dedupe/api.py", line 849, in sample original_length_2) File "/usr/local/lib/python3.6/site-packages/dedupe/labeler.py", line 321, in sample_product sample_size) File "/usr/local/lib/python3.6/site-packages/dedupe/labeler.py", line 67, in sample_product deque_2) File "/usr/local/lib/python3.6/site-packages/dedupe/sampling.py", line 23, in blockedSample *args)) File "/usr/local/lib/python3.6/site-packages/dedupe/sampling.py", line 122, in linkSamplePredicates yield linkSamplePredicate(subsample_size, predicate, items1, items2) File "/usr/local/lib/python3.6/site-packages/dedupe/sampling.py", line 144, in linkSamplePredicate block_keys = predicate_function(column) File "/usr/local/lib/python3.6/site-packages/dedupe/predicates.py", line 422, in latLongGridPredicate return (str([round(dim, digits) for dim in field]),) File "/usr/local/lib/python3.6/site-packages/dedupe/predicates.py", line 422, in <listcomp> return (str([round(dim, digits) for dim in field]),) TypeError: type str doesn't define __round__ method
Config:
"field_names": ["Account_Name", "Mailing_Street", "Mailing_Zip", "Mailing_Country","Mailing_City", "Mailing_State","Entity_Legal_Name","Australian_Business_Number","Geolocation"], "field_definition" : [{"field" : "Account_Name", "type" : "String"}, {"field" : "Mailing_Street", "type" : "String", "Has Missing" : true}, {"field" : "Mailing_Zip", "type" : "String", "Has Missing" : true}, {"field" : "Mailing_City", "type" : "String"}, {"field" : "Mailing_State", "type" : "String"}, {"field" : "Mailing_Country", "type" : "Exact"}, {"field" : "Entity_Legal_Name", "type" : "Exact", "Has Missing" : true}, {"field" : "Geolocation", "type" : "LatLong"}, {"field" : "Australian_Business_Number", "type" : "String", "Has Missing" : true}], "output_file": "output.csv", "skip_training": false, "training_file": "training.json", "sample_size": 150000, "recall_weight": 2 }
Data in csv looks like: (-37.985132, 145.214008)
@fgregg hopefully this is still maintained!
Fantastic package and hugely helpful
I'm getting a similar error trying to pass LatLong as a field in a CSV. Dedupe job just within a single CSV itself.
Running Dedupe 1.8.1, Python 3.6, on MacOSx
File "dedupe-try.py", line 131, in <module>
deduper.sample(data_d, 15000) #To train dedupe, we feed it a sample of records.
File "//anaconda/lib/python3.6/site-packages/dedupe/api.py", line 806, in sample
self.active_learner.sample_combo(data, blocked_proportion, sample_size)
File "//anaconda/lib/python3.6/site-packages/dedupe/labeler.py", line 151, in sample_combo
super(RLRLearner, self).sample_combo(*args)
File "//anaconda/lib/python3.6/site-packages/dedupe/labeler.py", line 38, in sample_combo
data)
File "//anaconda/lib/python3.6/site-packages/dedupe/sampling.py", line 23, in blockedSample
*args))
File "//anaconda/lib/python3.6/site-packages/dedupe/sampling.py", line 62, in dedupeSamplePredicates
items)
File "//anaconda/lib/python3.6/site-packages/dedupe/sampling.py", line 81, in dedupeSamplePredicate
block_keys = predicate_function(column)
File "//anaconda/lib/python3.6/site-packages/dedupe/predicates.py", line 406, in latLongGridPredicate
return (str([round(dim, digits) for dim in field]),)
File "//anaconda/lib/python3.6/site-packages/dedupe/predicates.py", line 406, in <listcomp>
return (str([round(dim, digits) for dim in field]),)
TypeError: type str doesn't define __round__ method
Did anyone come up with a solution to this? I've tried storing my location column in the CSV as:
"123.45,-123.45"
"(123.45,-123.45)"
"[123.45,-123.45]"
... none of which work.
I was able to solve this problem by setting the LatLong column as a tuple containing float values rather than a string, i.e set the values in the Latlong column as (123.45 , 123.45).
You can see this example in the dedupe docs