validate icon indicating copy to clipboard operation
validate copied to clipboard

data.table operations within validate

Open katrinabrock opened this issue 6 years ago • 0 comments

Hello, I am trying to use data.table [i, j, by ] operations inside my validator. Based on this comment https://github.com/data-cleaning/validate/issues/55#issuecomment-205220681 , in 2016 this was not supported. Is that still the case?

Here's an example of the kind of operation I'm trying to do:

  1. melt the data table
  2. then subset it with an [i, j, by] statement
  3. then run the check on the result
## MELT POC
library(data.table)
​
# example from https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html
s1 <- "family_id age_mother dob_child1 dob_child2 dob_child3
1         30 1998-11-26 2000-01-29         NA
2         27 1996-06-22         NA         NA
3         26 2002-07-11 2004-04-05 2007-09-02
4         32 2004-10-10 2009-08-27 2012-07-21
5         29 2000-12-05 2005-02-28         NA"
DT <- fread(s1)
DT.m1 = melt(DT, id.vars = c("family_id", "age_mother"),
                measure.vars = c("dob_child1", "dob_child2", "dob_child3"))
​
## Run the validation without validator package
melt(
    DT,
    id.vars = c("family_id", "age_mother"),
    measure.vars = c("dob_child1", "dob_child2", "dob_child3")
)[
    variable == "dob_child1",
][['age_mother']] > 30
​
​
# validator
library(validate)
​
working_validator <- validator(
    melt(.,
        id.vars = c("family_id", "age_mother"),
        measure.vars = c("dob_child1", "dob_child2", "dob_child3")
    )[['age_mother']] > 30
)
​
working_res <- confront(DT, working_validator)
summary(working_res)
​
​
non_working_validator <- validator(
    melt(.,
        id.vars = c("family_id", "age_mother"),
        measure.vars = c("dob_child1", "dob_child2", "dob_child3")
    )[
        variable == "dob_child1",
    ][['age_mother']] > 30
)
non_working_res <- confront(DT, non_working_validator)
non_working_res$._error

I'm aware with this example, I could run the age_mother > 30 & variable =="dob_child1" on the original data or use subset, but I'd like to generally enable more complex data.table workflows.

katrinabrock avatar Jan 07 '20 01:01 katrinabrock