holoclean Added EM iterations to repair process, allow multiple init values, and select best init value as current via co-occurrence probability

This PR introduces EM iterations to the repair process where after every iteration as well as supporting multiple init values:

created separate column for init values (1 or more) and current value (singular value, old 'init_value')
all featurizers have been changed to reference current_value and renamed from e.g. InitFeaturizer to CurrentFeaturizer
update current_values in cell_domain with inferred values from inf_vals_dom
re-run featurization + training + inference with new current_values (featurizers such as CurrentAttrFeaturizer or CurrentXFeaturizer) can take advantage of the updated current values
fixed a bug in InitSimFeaturizer where it wasn't computing the similarity metrics correctly between the init_value and values in the domain
fixed a bug where we weren't properly detecting NULL values in NullDetector
current_value is initialized with the value from init_values with the highest sum of co-occurrence probabilities with the other init_values in the tuple

I've tested this with 3 iterations with the hospital dataset. On the second iteration we see an improvement in recall (with a slight hit to precision) due to the increased number of repairs made. It seems to converge after the 2nd iteration.

INFO:root:Precision = 1.00, Recall = 0.43, Repairing Recall = 0.48, F1 = 0.60, Repairing F1 = 0.65, Detected Errors = 458, Total Errors = 509, Correct Repairs = 219, Total Repairs = 219, Total Repairs (clean data) = 219

INFO:root:Precision = 1.00, Recall = 0.44, Repairing Recall = 0.48, F1 = 0.61, Repairing F1 = 0.65, Detected Errors = 458, Total Errors = 509, Correct Repairs = 222, Total Repairs = 222, Total Repairs (clean data) = 222

INFO:root:Precision = 1.00, Recall = 0.44, Repairing Recall = 0.48, F1 = 0.61, Repairing F1 = 0.65, Detected Errors = 458, Total Errors = 509, Correct Repairs = 222, Total Repairs = 222, Total Repairs (clean data) = 222

NB: this PR does not currently include the detection process in the EM iterations: this might be worth considering.

Nov 22 '18 02:11 richardwu

How on earth is F1 and Recall > 1.0? See repairing F1 and repairing recall.

Nov 22 '18 17:11 thodrek

Pushed up a patch to update single/co-occur stats after each EM iteration for OccurFeaturizer. Interestingly enough for the second iteration our recall goes up but our precision goes down (since we are doing more repairs):

// After iteration 1
INFO:root:Precision = 0.93, Recall = 0.68, Repairing Recall = 0.76, F1 = 0.79, Repairing F1 = 0.84, Detected Errors = 458, Total Errors = 509, Correct Repairs = 347, Total Repairs = 372, Total Repairs (clean data) = 372

// After iteration 2
INFO:root:Precision = 0.89, Recall = 0.71, Repairing Recall = 0.79, F1 = 0.79, Repairing F1 = 0.83, Detected Errors = 458, Total Errors = 509, Correct Repairs = 361, Total Repairs = 407, Total Repairs (clean data) = 407

I attempted to do more iterations but there is an issue with how we use Pools where we allocate a new pool of workers every time. I'll fix this in a separate PR.

Nov 22 '18 17:11 richardwu

@thodrek I forgot to update current_value to init_values for total_repairs_clean. I've since fixed it (https://github.com/HoloClean/holoclean/blob/32ae5efc567dcf839d4310a87e81a71ee002b9d0/evaluate/eval.py#L164).

Nov 22 '18 17:11 richardwu

Sounds good.

Nov 22 '18 17:11 thodrek

Newest results with this patch with fix to InitAttrFeaturizer (now called CurrentAttrFeaturizer

INFO:root:Precision = 1.00, Recall = 0.43, Repairing Recall = 0.48, F1 = 0.60, Repairing F1 = 0.65, Detected Errors = 458, Total Errors = 509, Correct Repairs = 219, Total Repairs = 219, Total Repairs (clean data) = 219

INFO:root:Precision = 1.00, Recall = 0.44, Repairing Recall = 0.48, F1 = 0.61, Repairing F1 = 0.65, Detected Errors = 458, Total Errors = 509, Correct Repairs = 222, Total Repairs = 222, Total Repairs (clean data) = 222

INFO:root:Precision = 1.00, Recall = 0.44, Repairing Recall = 0.48, F1 = 0.61, Repairing F1 = 0.65, Detected Errors = 458, Total Errors = 509, Correct Repairs = 222, Total Repairs = 222, Total Repairs (clean data) = 222

Latest changes:

multiple initial values in raw dataset (values separated by '|||') work now
current_stats=True will enable statistics to be re-collected on new current values after each EM iteration

Ready for another review 👀

Nov 24 '18 02:11 richardwu