afex icon indicating copy to clipboard operation
afex copied to clipboard

Argument order in aov_ez

Open dgromer opened this issue 10 years ago • 8 comments

Is there a specific reason why id is the first argument in aov_ez, and data is the third?

It would make more sense to me if data was the first argument, because then it would fit nicely into pipelines like

data %>%
  dplyr::filter(some_filtering_here) %>%
  aov_ez("id", "dv", further_arguments_here)

instead of now

data %>%
  dplyr::filter(some_filtering_here) %>%
  aov_ez("id", "dv", ., further_arguments_here)

dgromer avatar Sep 20 '15 16:09 dgromer

I think the reason for having both id and dv before data was to clearly separate those required arguments from the two more or less optional arguments between and within. But I see that even without dplyr it could make sense to have data as first argument, e.g., when using lapply. On the other hand, this design would easily allow to run ANOVAs on many subsets of the data which should in principle only exacerbate the existing problem of Type I error accumulation (which exists for any multifactor ANOVA).

I also have just recently made some rather strong changes to the interface so I am not sure now is the time for the next ones, but it is any idea I will keep in mind. Perhaps only adding an alternative version with changed ordering of arguments could work for the time being, e.g., aov_ez2.

singmann avatar Sep 20 '15 19:09 singmann

Having data as the first argument seems more intuitive to me, since then all following arguments clearly refer to this data frame. Right now, the first two arguments are somewhat out of context in my opinon. And aov_ez would be more similar to ez::ezANOVA ;)

However, I wouldn't include the aov_ez2 wrapper, because it could make things more complicated for people starting with the package.

dgromer avatar Sep 21 '15 07:09 dgromer

I am somewhat inclined to make this change. The only problem is, this will really break a lot of existing code using this function, so it would be quite a big change.

I have some plans for making some rather drastic changes (e.g., harmonizing all function and argument names to use _ instead of .) for version 1.0. And if I decide to do so, I will include this change as well (I will keep it open to remind me).

singmann avatar Oct 08 '15 13:10 singmann

@singmann This is somewhat related - when there are only between-s factors (and only one observation per subject) there really is no reason to have an input for id. I know it is not good practice to allow for the first argument to be missing while requireing others, but it is possible to have:

aov_ez <- function (id, dv, data, 
                    between = NULL, within = NULL, covariate = NULL, 
                    observed = NULL,
                    fun_aggregate = NULL, transformation, 
                    type = afex_options("type"), 
                    factorize = afex_options("factorize"),
                    check_contrasts = afex_options("check_contrasts"), 
                    return = afex_options("return_aov"), 
                    anova_table = list(), 
                    include_aov = afex_options("include_aov"), 
                    ..., 
                    print.formula = FALSE) {
  
  if (missing(id)) {
    # warning?
    data$.id_var <- seq_len(nrow(data))
    id <- ".id_var"
  }
  
  ...
}

mattansb avatar Oct 12 '21 07:10 mattansb

First of all great to get back to this issue after a long time. Funny to read my thoughts from 5 years back even though I decided against implementing that.

Anyway, to get to your point @mattansb, I know that sometimes one just has between-subjects data and doesn't really need the participant identifier. However, I feel like that from a conceptual and pedagogical perspective, always requiring the user to specify the participant identifier is good. A data set should always have such a column to ensure that nothing goes wrong during data manipulation/preparation. From my teaching experience, it is not too difficult to explain that data always needs this column. However, data without a participant identifier can lead to problems.

What this means is that I am unlikely to accept changes that will enable this behaviour. In any way, if I were to be convinced, it would have to be added to aov_car() as well as this is the main function.

singmann avatar Oct 12 '21 13:10 singmann

Fair enough (:

In any case, regarding @dgromer original issue, using the native pipe (R >= 4.1.0), one can still pipe even without the . operator, with some creativity:

data(obk.long, package = "afex")

obk.long |> 
  dplyr::filter(gender == "F") |> 
  afex::aov_ez(id = "id", dv = "value", 
               between = "treatment", 
               within = c("phase", "hour"))
#> Anova Table (Type 3 tests)
#> 
#> Response: value
#>                 Effect     df   MSE        F  ges p.value
#> 1            treatment   2, 5 11.83     2.56 .240    .171
#> 2                phase  2, 10  5.08   4.64 * .197    .038
#> 3      treatment:phase  4, 10  5.08     2.40 .202    .119
#> 4                 hour  4, 20  2.16 7.66 *** .256   <.001
#> 5       treatment:hour  8, 20  2.16     0.28 .025    .963
#> 6           phase:hour  8, 40  0.97     1.21 .047    .316
#> 7 treatment:phase:hour 16, 40  0.97     0.60 .046    .864
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
#> 
#> Sphericity correction method: GG

mattansb avatar Oct 12 '21 14:10 mattansb

The native pipe call works great. Thanks for showing that.

Can you elaborate a bit in which real data analysis situation omitting the id variables is really a tremendous benefit? I know that some example data does not have it, but I feel like real data basically always has it, so it does not seem like an actual problem to me.

singmann avatar Oct 12 '21 15:10 singmann

Nah, you were right - best practice would be to have an ID column.

mattansb avatar Oct 13 '21 09:10 mattansb