Request for display of error messages and ability to check observed and expected values about performance_hosmer()
For example, when trying to run the Hosmer-Lemeshow goodness-of-fit test for the following multiple logistic regression analysis, the following error message is displayed.
data("Titanic")
df <- data.frame(Titanic)
df <- data.frame(Class = rep(df$Class, df$Freq),
Sex = rep(df$Sex, df$Freq),
Age = rep(df$Age, df$Freq),
Survived = rep(df$Survived, df$Freq))
model <- glm(Survived ~ . ,data = df, family = binomial())
library(performance)
performance_hosmer(model)
## Error in cut.default(yhat, breaks = stats::quantile(yhat, probs = seq(0, : 'breaks' are not unique
As the error message indicates, this is because the range of quantiles cannot be uniquely determined when taking deciles.
However, I think this error message alone is somewhat informational and it is difficult to understand why the calculation was not completed correctly.
For example, I think it would be more informative if the following error message were displayed to indicate that the calculation could not be performed with the specified n_bins when the quantile range could not be determined.
cutyhat <- try(
cut(
yhat,
breaks = stats::quantile(yhat, probs = seq(0, 1, 1 / n_bins)),
include.lowest = TRUE
),
silent = TRUE
)
if("try-error" %in% is(cutyhat))
insight::format_error("Subgroups cannot be defined with the specified `n_bins`.")
I think this would provide a little information in case of errors.
performance_hosmer(model)
## Error: Subgroups cannot be defined with the specified `n_bins`.
Also, I would like to refer to the observed and expected value
information for each subgroup after the calculation to check for biased
subgroups. For example, so that I can call it up in chisq.test() using
$.
x <- matrix(c(12, 5, 7, 7), ncol = 2)
chisq.test(x)$observed
## [,1] [,2]
## [1,] 12 7
## [2,] 5 7
chisq.test(x)$expected
## [,1] [,2]
## [1,] 10.419355 8.580645
## [2,] 6.580645 5.419355
For example, the following would achieve this.
hoslem <- list(
chisq = chisq,
df = n_bins - 2,
p.value = p.value,
observed = obs,
expected = expect
)
model <- glm(vs ~ wt + mpg, data = mtcars, family = "binomial")
performance_hosmer(model)$observed
##
## cutyhat V1 y
## [0.0174,0.051] 4 0
## (0.051,0.0828] 3 0
## (0.0828,0.158] 3 0
## (0.158,0.243] 2 1
## (0.243,0.369] 2 1
## (0.369,0.52] 2 1
## (0.52,0.615] 1 2
## (0.615,0.868] 0 3
## (0.868,0.982] 1 2
## (0.982,0.998] 0 4
performance_hosmer(model)$expected
##
## cutyhat V1 yhat
## [0.0174,0.051] 3.87984951 0.12015049
## (0.051,0.0828] 2.77779066 0.22220934
## (0.0828,0.158] 2.67732615 0.32267385
## (0.158,0.243] 2.38429554 0.61570446
## (0.243,0.369] 2.06107259 0.93892741
## (0.369,0.52] 1.67780793 1.32219207
## (0.52,0.615] 1.35146089 1.64853911
## (0.615,0.868] 0.90896607 2.09103393
## (0.868,0.982] 0.24958571 2.75041429
## (0.982,0.998] 0.03184495 3.96815505
This observed and expected value information does not have to be
displayed when print.performance_hosmer() is done, but it would be
nice if it could be called out with $.
Also, I think the value in the document should be “An object of class performance_hosmer with the following values”, not “An object of class hoslem_test with the following values”.
Of course, I know there are various arguments for the Hosmer-Lemeshow test, I would appreciate it if you would consider whether or not to implement each of these, and I don’t care how they are implemented.
If these are as suggested, I’ll do a pull request, if not, that’s fine.