Incorrectly Calculated P-Values

Open Vonyx1000 opened this issue 1 year ago • 1 comments

Hello Benjamin!

Thank you so much for this wonderful package and your incredible work. I have benefited a lot from your package and appreciate the time and effort you put into it. I've been running into an issue with some of the p-values being calculated not the way I want them to be. I am working on a project and ran the following code:

pvalue <- function(x, ...) {
  # Construct vectors of data y, and groups (strata) g
  y <- unlist(x)
  g <- factor(rep(1:length(x), times=sapply(x, length)))
  if (is.numeric(y)) {
    # For numeric variables, perform a standard 2-sample t-test
    # p <- t.test(y ~ g)$p.value
  } else {
    # For categorical variables, perform individual chi-squared tests for each category
    p <- sapply(levels(y), function(z) chisq.test(table(y==z, g))$p.value)
  }
  # Format the p-value, using an HTML entity for the less-than sign.
  # The initial empty string places the output on the line below the variable label.
  c("", sub("<", "&lt;", format.pval(p, digits=3, eps=0.001)))
}

#2014/2015 Data
t <- table1(~ `Command`
            | CombinedYear*TextScore, data = s201415, extra.col=list(`P-value`=pvalue)
            , digits=5,overall = F,topclass="Rtable1-zebra Rtable1-shade Rtable1-times"
)

The t-test is commented out because all of the data in this project is categorical. I got the following result:

I am looking at evaluation results where we have 97 groups perform certain commands and they were marked as either "Not Done" or "Well Done" for each command. Since I am looking at evaluation results, I have 97 groups who were evaluated so each row adds up to 97 unless the data is missing (if it was missing, it was excluded for that row). When I manually calculate these p-values out, for example the first one "Command3", it should be significant. See example from basic chi-square calculator website online: Source: https://www.socscistatistics.com/tests/chisquare2/default2.aspx

I have been troubleshooting for some time but I am unsure of what the issue is. I apologize if it is something that should be obvious, as I am not super experienced. I think it might be that instead of using 97 as the total, it is using the totals in the header (which is Not Done N=143 and Well Done N=831). Do you think you would be able to provide some insight as to why the p-values are being calculated this way? If so, how can I fix it?

Thank you so much in advance.

Warmest regards, Vaish

May 23 '24 01:05 Vonyx1000

First, you have to ask yourself what are the hypotheses that you are testing (a p-value is always associated to a hypothesis test). You need to formulate this clearly or order for the p-value to have the desired meaning.

Note that in the screenshot you posted from the website, I don't think it is formulated correctly because you have a 2x2 contingency table with a grand total of 194(!). I don't think this represents your situation, but again you need to formulate the hypotheses clearly first.

May 23 '24 12:05 benjaminrich