Added EM iterations to repair process, allow multiple init values, and select best init value as current via co-occurrence probability
This PR introduces EM iterations to the repair process where after every iteration as well as supporting multiple init values:
- created separate column for init values (1 or more) and current value (singular value, old 'init_value')
- all featurizers have been changed to reference
current_valueand renamed from e.g.InitFeaturizertoCurrentFeaturizer - update
current_values incell_domainwith inferred values frominf_vals_dom - re-run featurization + training + inference with new
current_values (featurizers such asCurrentAttrFeaturizerorCurrentXFeaturizer) can take advantage of the updated current values - fixed a bug in
InitSimFeaturizerwhere it wasn't computing the similarity metrics correctly between theinit_valueand values in the domain - fixed a bug where we weren't properly detecting
NULLvalues inNullDetector -
current_valueis initialized with the value frominit_valueswith the highest sum of co-occurrence probabilities with the otherinit_valuesin the tuple
I've tested this with 3 iterations with the hospital dataset. On the second iteration we see an improvement in recall (with a slight hit to precision) due to the increased number of repairs made. It seems to converge after the 2nd iteration.
INFO:root:Precision = 1.00, Recall = 0.43, Repairing Recall = 0.48, F1 = 0.60, Repairing F1 = 0.65, Detected Errors = 458, Total Errors = 509, Correct Repairs = 219, Total Repairs = 219, Total Repairs (clean data) = 219
INFO:root:Precision = 1.00, Recall = 0.44, Repairing Recall = 0.48, F1 = 0.61, Repairing F1 = 0.65, Detected Errors = 458, Total Errors = 509, Correct Repairs = 222, Total Repairs = 222, Total Repairs (clean data) = 222
INFO:root:Precision = 1.00, Recall = 0.44, Repairing Recall = 0.48, F1 = 0.61, Repairing F1 = 0.65, Detected Errors = 458, Total Errors = 509, Correct Repairs = 222, Total Repairs = 222, Total Repairs (clean data) = 222
NB: this PR does not currently include the detection process in the EM iterations: this might be worth considering.
How on earth is F1 and Recall > 1.0? See repairing F1 and repairing recall.
Pushed up a patch to update single/co-occur stats after each EM iteration for OccurFeaturizer. Interestingly enough for the second iteration our recall goes up but our precision goes down (since we are doing more repairs):
// After iteration 1
INFO:root:Precision = 0.93, Recall = 0.68, Repairing Recall = 0.76, F1 = 0.79, Repairing F1 = 0.84, Detected Errors = 458, Total Errors = 509, Correct Repairs = 347, Total Repairs = 372, Total Repairs (clean data) = 372
// After iteration 2
INFO:root:Precision = 0.89, Recall = 0.71, Repairing Recall = 0.79, F1 = 0.79, Repairing F1 = 0.83, Detected Errors = 458, Total Errors = 509, Correct Repairs = 361, Total Repairs = 407, Total Repairs (clean data) = 407
I attempted to do more iterations but there is an issue with how we use Pools where we allocate a new pool of workers every time. I'll fix this in a separate PR.
@thodrek I forgot to update current_value to init_values for total_repairs_clean. I've since fixed it (https://github.com/HoloClean/holoclean/blob/32ae5efc567dcf839d4310a87e81a71ee002b9d0/evaluate/eval.py#L164).
Sounds good.
Newest results with this patch with fix to InitAttrFeaturizer (now called CurrentAttrFeaturizer
INFO:root:Precision = 1.00, Recall = 0.43, Repairing Recall = 0.48, F1 = 0.60, Repairing F1 = 0.65, Detected Errors = 458, Total Errors = 509, Correct Repairs = 219, Total Repairs = 219, Total Repairs (clean data) = 219
INFO:root:Precision = 1.00, Recall = 0.44, Repairing Recall = 0.48, F1 = 0.61, Repairing F1 = 0.65, Detected Errors = 458, Total Errors = 509, Correct Repairs = 222, Total Repairs = 222, Total Repairs (clean data) = 222
INFO:root:Precision = 1.00, Recall = 0.44, Repairing Recall = 0.48, F1 = 0.61, Repairing F1 = 0.65, Detected Errors = 458, Total Errors = 509, Correct Repairs = 222, Total Repairs = 222, Total Repairs (clean data) = 222
Latest changes:
- multiple initial values in raw dataset (values separated by
'|||') work now -
current_stats=Truewill enable statistics to be re-collected on newcurrentvalues after each EM iteration
Ready for another review 👀