dedupe
dedupe copied to clipboard
index predicates always used by labeler
It often takes a considerable amount of time to build a canopy index when setting up the sample for labelling. For instance, this step is taking over 45 minutes to complete on a 20k sample using the variable spec below for deduping.
When I specify index_predicates = False, I would expect that choice to apply to the active learning stage, but it doesn't. That choice is only applied to the final training step. Is this a bug or is there some rationale behind it?
Or, perhaps there's a better way to setup variables?
[
{'field':'first_name', 'variable name':'first_name', 'type':'String', 'has_missing':True},
{'field':'last_name', 'variable name':'last_name', 'type':'String', 'has_missing':True},
{'field':'long_addr', 'variable name':'long_addr', 'type':'String', 'has_missing':True},
{'field':'postcode', 'variable name':'postcode', 'type':'String', 'has_missing':True},
{'field':'country', 'variable name':'country', 'type':'String', 'has_missing':True},
{'field':'email', 'variable name':'email', 'type':'String', 'has_missing':True},
{'field':'phone', 'variable name':'phone', 'type':'String', 'has_missing':True},
{'type': 'Interaction', 'interaction variables': ['country', 'last_name']},
{'type': 'Interaction', 'interaction variables': ['country', 'postcode']},
{'type': 'Interaction', 'interaction variables': ['country', 'phone']}
]