performance icon indicating copy to clipboard operation
performance copied to clipboard

Request for display of error messages and ability to check observed and expected values about performance_hosmer()

Open indenkun opened this issue 2 years ago • 0 comments

For example, when trying to run the Hosmer-Lemeshow goodness-of-fit test for the following multiple logistic regression analysis, the following error message is displayed.

data("Titanic")
df <- data.frame(Titanic)
df <- data.frame(Class = rep(df$Class, df$Freq),
                 Sex = rep(df$Sex, df$Freq),
                 Age = rep(df$Age, df$Freq),
                 Survived = rep(df$Survived, df$Freq))
model <- glm(Survived ~ . ,data = df, family = binomial())
library(performance)
performance_hosmer(model)
## Error in cut.default(yhat, breaks = stats::quantile(yhat, probs = seq(0, : 'breaks' are not unique

As the error message indicates, this is because the range of quantiles cannot be uniquely determined when taking deciles.

However, I think this error message alone is somewhat informational and it is difficult to understand why the calculation was not completed correctly.

For example, I think it would be more informative if the following error message were displayed to indicate that the calculation could not be performed with the specified n_bins when the quantile range could not be determined.

cutyhat <- try(
  cut(
    yhat,
    breaks = stats::quantile(yhat, probs = seq(0, 1, 1 / n_bins)),
    include.lowest = TRUE
  ),
  silent = TRUE
)
if("try-error" %in% is(cutyhat))
  insight::format_error("Subgroups cannot be defined with the specified `n_bins`.")

I think this would provide a little information in case of errors.

performance_hosmer(model)
## Error: Subgroups cannot be defined with the specified `n_bins`.

Also, I would like to refer to the observed and expected value information for each subgroup after the calculation to check for biased subgroups. For example, so that I can call it up in chisq.test() using $.

x <- matrix(c(12, 5, 7, 7), ncol = 2)
chisq.test(x)$observed
##      [,1] [,2]
## [1,]   12    7
## [2,]    5    7
chisq.test(x)$expected
##           [,1]     [,2]
## [1,] 10.419355 8.580645
## [2,]  6.580645 5.419355

For example, the following would achieve this.

hoslem <- list(
    chisq = chisq,
    df = n_bins - 2,
    p.value = p.value,
    observed = obs,
    expected = expect
  )
model <- glm(vs ~ wt + mpg, data = mtcars, family = "binomial")
performance_hosmer(model)$observed
##                 
## cutyhat          V1 y
##   [0.0174,0.051]  4 0
##   (0.051,0.0828]  3 0
##   (0.0828,0.158]  3 0
##   (0.158,0.243]   2 1
##   (0.243,0.369]   2 1
##   (0.369,0.52]    2 1
##   (0.52,0.615]    1 2
##   (0.615,0.868]   0 3
##   (0.868,0.982]   1 2
##   (0.982,0.998]   0 4
performance_hosmer(model)$expected
##                 
## cutyhat                  V1       yhat
##   [0.0174,0.051] 3.87984951 0.12015049
##   (0.051,0.0828] 2.77779066 0.22220934
##   (0.0828,0.158] 2.67732615 0.32267385
##   (0.158,0.243]  2.38429554 0.61570446
##   (0.243,0.369]  2.06107259 0.93892741
##   (0.369,0.52]   1.67780793 1.32219207
##   (0.52,0.615]   1.35146089 1.64853911
##   (0.615,0.868]  0.90896607 2.09103393
##   (0.868,0.982]  0.24958571 2.75041429
##   (0.982,0.998]  0.03184495 3.96815505

This observed and expected value information does not have to be displayed when print.performance_hosmer() is done, but it would be nice if it could be called out with $.

Also, I think the value in the document should be “An object of class performance_hosmer with the following values”, not “An object of class hoslem_test with the following values”.

Of course, I know there are various arguments for the Hosmer-Lemeshow test, I would appreciate it if you would consider whether or not to implement each of these, and I don’t care how they are implemented.

If these are as suggested, I’ll do a pull request, if not, that’s fine.

indenkun avatar May 04 '23 13:05 indenkun