memisc icon indicating copy to clipboard operation
memisc copied to clipboard

Items only recognized as numerical even when measurement is "nominal"

Open bixiou opened this issue 1 year ago • 17 comments

I noticed a bug with R 4.3.1 (my colleague does not have the bug with R 4.2.1). The following code returns NA instead of TRUE:

temp <- as.item(c(1), labels = structure(c(0, 1), names = c("No", "Yes")))
temp == "Yes"

This is despite measurement(temp) being nominal, i.e. we would expect that by default, temp is considered as string and not numerical.

Note that both temp == 1 and temp %in% "Yes" correctly return TRUE.

Is there a way to change this behavior?

My version of memisc is 0.99.22 (or rather, a patched version of 0.99.22, as explained here).

bixiou avatar May 13 '24 14:05 bixiou

The current version of memisc on CRAN is 0.99.31.7. Does this bug show up with that version as well?

melff avatar May 13 '24 15:05 melff

No, the bug is with the older version. But I need this older version, it has useful features which are not in the new one.

bixiou avatar May 14 '24 02:05 bixiou

Unfortunately, I can only support the newest version of the package. If there is any functionality missing in the newest version, please describe it, and I may consider (re-)implementing it or provide a workaround.

melff avatar May 15 '24 07:05 melff

As explained here, memisc 0.99.22 allows to run regressions without dropping data ('PNR' is considered as a response category, which it is) while allowing to treat these 'PNR' as missing in the analysis (i.e. have a special status for descriptive stats, graphs...).

It would be great, indeed, if this feature could be retained in the new versions.

bixiou avatar May 15 '24 07:05 bixiou

Any fix for this? I can't find any other package allowing me to behave the way I'd like. Namely, I'd like this behavior, cf. here:

test <- as.item(c(1, NA, -1), labels = structure(c(0, 1, -1), names = c("No", "Yes", "PNR")), missing.values = c(NA, -1))
  
as.character(test[1]) # "Yes"
as.numeric(test[1]) # 1
test %in% 1 # TRUE FALSE FALSE
test == 1 # TRUE NA FALSE
test %in% "Yes" # TRUE FALSE FALSE
test == "Yes" # TRUE NA FALSE
is.na(test) # FALSE TRUE FALSE
is.missing(test) # FALSE TRUE TRUE 
lm(c(T, T, T) ~ test)$rank # 2 (i.e., keeps missing values that are not NA)

bixiou avatar Feb 28 '25 16:02 bixiou

Please clarify: Is

test %in% 1 # TRUE FALSE FALSE
test == 1 # TRUE NA FALSE

the expected behaviour or the behaviour you observe? Thanks.

melff avatar Feb 28 '25 17:02 melff

This is the expected behaviour.

bixiou avatar Feb 28 '25 17:02 bixiou

What ever the intended behaviour is, I believe it can be realized with the appropriate adjustments in the code.

This is what I get with current memisc:

library(memisc)
test <- as.item(c(1, NA, -1), 
                  labels = c(No=0, Yes = 1, PNR = -1), 
                  missing.values = -1)

is.na(test)
# [1] FALSE  TRUE FALSE
is.missing(test)
# [1] FALSE  TRUE  TRUE

test_im <- include.missings(test)

as.character(test)
# [1] "Yes" NA    NA
as.character(test_im)
# [1] "Yes"  NA     "*PNR"
# To suppress the star, use 'include.missings(x,mark="")'

as.numeric(test) 
# [1]  1 NA NA
as.numeric(test_im) 
# [1]  1 NA -1

test %in% 1 
# [1]  TRUE FALSE FALSE
test_im %in% 1 
# [1]  TRUE FALSE FALSE

test == 1
# [1]  TRUE    NA FALSE
test_im == 1
# [1]  TRUE    NA FALSE
as.numeric(test) == 1
# [1] TRUE   NA   NA

test %in% "Yes"
# [1]  TRUE FALSE FALSE

test == "Yes"
# [1]  TRUE    NA FALSE
test_im == "Yes"
# [1]  TRUE    NA FALSE
as.character(test) == "Yes"
# [1] TRUE   NA   NA

lm(c(T, T, T) ~ test)$rank
# [1] 2

lm(c(T, T, T) ~ test_im)$rank
# [1] 2

# This will not work:
ds <- data.set(test = test)
lm(c(T, T, T) ~ test, data = ds)$rank

melff avatar Mar 01 '25 21:03 melff

Oh that's great, it seems that the new memisc actually works almost as I would like... except for the last line, which is disappointing. Why does it stop working as intended in regressions when it's inside a dataframe? Is there a way to make it work for dataframes in regressions?

bixiou avatar Mar 01 '25 23:03 bixiou

Try this:

library(memisc)
ds <- data.set(
  test = as.item(c(1, NA, -1), 
                  labels = c(No=0, Yes = 1, PNR = -1), 
                  missing.values = -1)
)

ds %$$% { # Shorthand for ds <- within(ds,...)
  test_im <- include.missings(test)
}
lm(c(T, T, T) ~ test_im, data = ds)$rank

melff avatar Mar 02 '25 15:03 melff

Thank you for offering a solution. The issue is that this solution works for the regression but not for the previous tests (e.g. test == "Yes" returns NA NA NA and test %in% 1 returns FALSE FALSE FALSE). This is prone to mistakes, and to handle all cases, it requires having two variables instead of one.

bixiou avatar Mar 02 '25 15:03 bixiou

but not for the previous tests (e.g. test == "Yes" returns NA NA NA and test %in% 1 returns FALSE FALSE FALSE) This is not correct. ds$test behaves exactly as intended:

ds$test == 1
# [1]  TRUE    NA FALSE
ds$test == "Yes"
# [1]  TRUE    NA FALSE

It is the intended purpose of "item" objects with labels to make both comparisons possible. This is to make the preparation of survey data easier.

The idea of user-defined missing values is that they can be compared according to their codes and labels, but that they are automatically excluded from statistical analyses. Any other behaviour can be achieved by using either as.numeric(), as.character(), or include.missings().

melff avatar Mar 02 '25 18:03 melff

Sorry I meant test_im (not test), the new variable that you created. My complaint is that we don't have the same behavior after include.missings().

bixiou avatar Mar 02 '25 18:03 bixiou

Comparisons work with test_im as with test. The only difference is (as intended) that -1 or "PNR" are not translated into NA, when the regression is run.

ds$test_im == 1
# [1]  TRUE    NA FALSE
ds$test_im == "Yes"
# [1]  TRUE    NA FALSE

melff avatar Mar 02 '25 18:03 melff

Now that I have installed the most recent version of memisc, I confirm that the comparisons work with both test and test_im. My bad, the issue belonged to memisc 0.99.22.

I think I'll adopt the latest version of memisc and add include.missings() whenever needed (i.e., in a lot of places), as I don't have a better option (in particular, the other packages to handle surveys feature more annoying behaviors).

The issue is that include.missings(1) and include.missings("a") return an error (same for factors), so it's quite annoying because I can't even add "include.missings" everywhere, I have to distinguish depending on the variable class. It would be much easier if include.missings worked with any class (and if as.character() and regressions included missing values by default).

bixiou avatar Mar 03 '25 23:03 bixiou

You can include setMethod("include.missings","ANY",function(x,mark="*") x) at the beginning of your scripts to avoid these errors.

I will include this into the current version of memisc. You can install it anytime using install.packages("memisc", repos = c("https://melff.r-universe.dev", "https://cloud.r-project.org")).

Also, if you need the missing values included, you can create a copy of your data.set object and do

new.data.set %$$% {  # Shorthand for ... <- within(...,...)
    foreach(var=c(var1,var2,...),{
        missing.values(var) <- NULL
    })
}

melff avatar Mar 05 '25 11:03 melff

Thank you for expending the domain of include.missings, this is very useful! Is there a way to make include.missings = TRUE the default in as.character? And to automatically convert a memisc item as character in a regression? It's cumbersome to duplicate datasets (especially given that I handle multiple large datasets), so I'd prefer the above alternative.

bixiou avatar Mar 05 '25 14:03 bixiou