Eugene Wu
Eugene Wu
sometimes i just want it to print some warnings, return a database and not throw an error.
Is this what's needed to support? * Statistical constraint based on distance between group-by results and a user-supplied set of values (complaints) * Language: range and numerical predicates
In my mind the cost goes into applying the cleaning program (fast), computing the quality function (can be slow), and search. I imagine the following optimizations. Which are actually supported?...
Can be done using vector operations instead of forloop. https://github.com/sjyk/alphaclean/blob/c0691df13aeec279ce1aae25ee6ac0cf700c10b4/alphaclean/constraint_languages/statistical.py#L33
In [example 3](https://github.com/sjyk/alphaclean/blob/master/docs/03-ExternalInfo.md), the resulting program has a lot of "delete CEO" type statements. If the user wanted to reconcile CEO occupations, what would they do? `df = delete(df,'contbr_occupation',('contbr_occupation', set(['OWNER...
When I make up fake data for example 1, the program does not change the data correctly: ``` data = pd.DataFrame([ dict(a="San Francisco", b="FOO") ]) dcprogram.run(data) # does not change...
https://github.com/sjyk/alphaclean/blob/c0691df13aeec279ce1aae25ee6ac0cf700c10b4/alphaclean/search.py#L128
Maybe the R script can generate a standalone pygg_functions.py file that pygg.py can import *? That way don't need to copy and paste in the future
Current version of ggplot() takes a variable name as input, by default "data", and relies on ggsave()'s prefix argument to set the data object. ``` ggplot('data', aes(...)) + ggsave(..., prefix=data_py(dataobject))...