Items only recognized as numerical even when measurement is "nominal"
I noticed a bug with R 4.3.1 (my colleague does not have the bug with R 4.2.1).
The following code returns NA instead of TRUE:
temp <- as.item(c(1), labels = structure(c(0, 1), names = c("No", "Yes")))
temp == "Yes"
This is despite measurement(temp) being nominal, i.e. we would expect that by default, temp is considered as string and not numerical.
Note that both temp == 1 and temp %in% "Yes" correctly return TRUE.
Is there a way to change this behavior?
My version of memisc is 0.99.22 (or rather, a patched version of 0.99.22, as explained here).
The current version of memisc on CRAN is 0.99.31.7. Does this bug show up with that version as well?
No, the bug is with the older version. But I need this older version, it has useful features which are not in the new one.
Unfortunately, I can only support the newest version of the package. If there is any functionality missing in the newest version, please describe it, and I may consider (re-)implementing it or provide a workaround.
As explained here, memisc 0.99.22 allows to run regressions without dropping data ('PNR' is considered as a response category, which it is) while allowing to treat these 'PNR' as missing in the analysis (i.e. have a special status for descriptive stats, graphs...).
It would be great, indeed, if this feature could be retained in the new versions.
Any fix for this? I can't find any other package allowing me to behave the way I'd like. Namely, I'd like this behavior, cf. here:
test <- as.item(c(1, NA, -1), labels = structure(c(0, 1, -1), names = c("No", "Yes", "PNR")), missing.values = c(NA, -1))
as.character(test[1]) # "Yes"
as.numeric(test[1]) # 1
test %in% 1 # TRUE FALSE FALSE
test == 1 # TRUE NA FALSE
test %in% "Yes" # TRUE FALSE FALSE
test == "Yes" # TRUE NA FALSE
is.na(test) # FALSE TRUE FALSE
is.missing(test) # FALSE TRUE TRUE
lm(c(T, T, T) ~ test)$rank # 2 (i.e., keeps missing values that are not NA)
Please clarify: Is
test %in% 1 # TRUE FALSE FALSE
test == 1 # TRUE NA FALSE
the expected behaviour or the behaviour you observe? Thanks.
This is the expected behaviour.
What ever the intended behaviour is, I believe it can be realized with the appropriate adjustments in the code.
This is what I get with current memisc:
library(memisc)
test <- as.item(c(1, NA, -1),
labels = c(No=0, Yes = 1, PNR = -1),
missing.values = -1)
is.na(test)
# [1] FALSE TRUE FALSE
is.missing(test)
# [1] FALSE TRUE TRUE
test_im <- include.missings(test)
as.character(test)
# [1] "Yes" NA NA
as.character(test_im)
# [1] "Yes" NA "*PNR"
# To suppress the star, use 'include.missings(x,mark="")'
as.numeric(test)
# [1] 1 NA NA
as.numeric(test_im)
# [1] 1 NA -1
test %in% 1
# [1] TRUE FALSE FALSE
test_im %in% 1
# [1] TRUE FALSE FALSE
test == 1
# [1] TRUE NA FALSE
test_im == 1
# [1] TRUE NA FALSE
as.numeric(test) == 1
# [1] TRUE NA NA
test %in% "Yes"
# [1] TRUE FALSE FALSE
test == "Yes"
# [1] TRUE NA FALSE
test_im == "Yes"
# [1] TRUE NA FALSE
as.character(test) == "Yes"
# [1] TRUE NA NA
lm(c(T, T, T) ~ test)$rank
# [1] 2
lm(c(T, T, T) ~ test_im)$rank
# [1] 2
# This will not work:
ds <- data.set(test = test)
lm(c(T, T, T) ~ test, data = ds)$rank
Oh that's great, it seems that the new memisc actually works almost as I would like... except for the last line, which is disappointing. Why does it stop working as intended in regressions when it's inside a dataframe? Is there a way to make it work for dataframes in regressions?
Try this:
library(memisc)
ds <- data.set(
test = as.item(c(1, NA, -1),
labels = c(No=0, Yes = 1, PNR = -1),
missing.values = -1)
)
ds %$$% { # Shorthand for ds <- within(ds,...)
test_im <- include.missings(test)
}
lm(c(T, T, T) ~ test_im, data = ds)$rank
Thank you for offering a solution. The issue is that this solution works for the regression but not for the previous tests (e.g. test == "Yes" returns NA NA NA and test %in% 1 returns FALSE FALSE FALSE). This is prone to mistakes, and to handle all cases, it requires having two variables instead of one.
but not for the previous tests (e.g. test == "Yes" returns NA NA NA and test %in% 1 returns FALSE FALSE FALSE) This is not correct.
ds$testbehaves exactly as intended:
ds$test == 1
# [1] TRUE NA FALSE
ds$test == "Yes"
# [1] TRUE NA FALSE
It is the intended purpose of "item" objects with labels to make both comparisons possible. This is to make the preparation of survey data easier.
The idea of user-defined missing values is that they can be compared according to their codes and labels, but that they are automatically excluded from statistical analyses. Any other behaviour can be achieved by using either as.numeric(), as.character(), or include.missings().
Sorry I meant test_im (not test), the new variable that you created.
My complaint is that we don't have the same behavior after include.missings().
Comparisons work with test_im as with test. The only difference is (as intended) that -1 or "PNR" are not translated into NA, when the regression is run.
ds$test_im == 1
# [1] TRUE NA FALSE
ds$test_im == "Yes"
# [1] TRUE NA FALSE
Now that I have installed the most recent version of memisc, I confirm that the comparisons work with both test and test_im. My bad, the issue belonged to memisc 0.99.22.
I think I'll adopt the latest version of memisc and add include.missings() whenever needed (i.e., in a lot of places), as I don't have a better option (in particular, the other packages to handle surveys feature more annoying behaviors).
The issue is that include.missings(1) and include.missings("a") return an error (same for factors), so it's quite annoying because I can't even add "include.missings" everywhere, I have to distinguish depending on the variable class. It would be much easier if include.missings worked with any class (and if as.character() and regressions included missing values by default).
You can include setMethod("include.missings","ANY",function(x,mark="*") x) at the beginning of your scripts to avoid these errors.
I will include this into the current version of memisc. You can install it anytime using install.packages("memisc", repos = c("https://melff.r-universe.dev", "https://cloud.r-project.org")).
Also, if you need the missing values included, you can create a copy of your data.set object and do
new.data.set %$$% { # Shorthand for ... <- within(...,...)
foreach(var=c(var1,var2,...),{
missing.values(var) <- NULL
})
}
Thank you for expending the domain of include.missings, this is very useful!
Is there a way to make include.missings = TRUE the default in as.character?
And to automatically convert a memisc item as character in a regression?
It's cumbersome to duplicate datasets (especially given that I handle multiple large datasets), so I'd prefer the above alternative.