dedupe icon indicating copy to clipboard operation
dedupe copied to clipboard

index predicates always used by labeler

Open amanderson opened this issue 6 years ago • 0 comments

It often takes a considerable amount of time to build a canopy index when setting up the sample for labelling. For instance, this step is taking over 45 minutes to complete on a 20k sample using the variable spec below for deduping.

When I specify index_predicates = False, I would expect that choice to apply to the active learning stage, but it doesn't. That choice is only applied to the final training step. Is this a bug or is there some rationale behind it?

Or, perhaps there's a better way to setup variables?

[
            {'field':'first_name',  'variable name':'first_name', 'type':'String', 'has_missing':True},
            {'field':'last_name',   'variable name':'last_name',  'type':'String', 'has_missing':True},
            {'field':'long_addr',   'variable name':'long_addr',  'type':'String', 'has_missing':True},
            {'field':'postcode',    'variable name':'postcode',   'type':'String', 'has_missing':True},
            {'field':'country',     'variable name':'country',    'type':'String', 'has_missing':True},
            {'field':'email',       'variable name':'email',      'type':'String', 'has_missing':True},
            {'field':'phone',       'variable name':'phone',      'type':'String', 'has_missing':True},
            {'type': 'Interaction', 'interaction variables': ['country', 'last_name']},
            {'type': 'Interaction', 'interaction variables': ['country', 'postcode']},
            {'type': 'Interaction', 'interaction variables': ['country', 'phone']}
]

amanderson avatar Jul 18 '19 09:07 amanderson