cem icon indicating copy to clipboard operation
cem copied to clipboard

L1 of continuous variables seem to be incorrect?

Open mkiang opened this issue 10 years ago • 7 comments

When running the following sample code on simulated data:

n_trt = 1000
n_untrt = 1000
prob_trt_male = .75
prob_untrt_male = .15
n_trt_males = sum(rbinom(n_trt, 1, prob_trt_male))
n_trt_females = n_trt - n_trt_males
n_untrt_males = sum(rbinom(n_trt, 1, prob_untrt_male))
n_untrt_females = n_untrt - n_untrt_males


fake <- data.frame(trt = c(rep(0, n_untrt), rep(1, n_trt)), 
                   sex = c(rep(0, n_untrt_females), rep(1, n_untrt_males), 
                           rep(0, n_trt_females), rep(1, n_trt_males)))
fake$old <- NA
for (i in seq_along(fake$trt)) {
    if (fake$trt[i] + fake$sex[i] == 2){
        fake$old[i] <- rbinom(1, 1, .1)
        fake$bp[i] <- rnorm(1, 100, 20)
    } else if (fake$trt[i] + fake$sex[i] == 0) {
        fake$old[i] <- rbinom(1, 1, .9)
        fake$bp[i] <- rnorm(1, 280, 50)
    } else {
        fake$old[i] <- rbinom(1, 1, .5)
        fake$bp[i] <- rnorm(1, 175, 8)
    }
}

imbalance(fake$trt, fake, drop = "trt")

Returns this output:

Multivariate Imbalance Measure: L1=0.954
Percentage of local common support: LCS=13.0%

Univariate Imbalance Measures:

    statistic   type    L1      min     25%      50%      75%      max
sex   -0.5850 (diff) 0.585  0.00000   0.000  -1.0000  -1.0000   0.0000
old    0.6410 (diff) 0.641  0.00000   1.000   1.0000   1.0000   0.0000
bp   143.4229 (diff) 0.000 85.02359 121.272 158.6741 153.2892 233.4842

The L1 of bp is 0.000, which seems impossible. I can replicate it with almost any continuous variable in R, but when running this in Stata, the L1 (both univariate and multivariate) seems appropriately calculated.

mkiang avatar Nov 05 '15 20:11 mkiang

hi Mathew, I'll look into it. Can you please join the output of sessionInfo() ? thanks Stefano

Sent from my StrawBerry®

On 05 nov 2015, at 21:48, Mathew Kiang [email protected] wrote:

When running the following sample code on simulated data:

n_trt = 1000 n_untrt = 1000 prob_trt_male = .75 prob_untrt_male = .15 n_trt_males = sum(rbinom(n_trt, 1, prob_trt_male)) n_trt_females = n_trt - n_trt_males n_untrt_males = sum(rbinom(n_trt, 1, prob_untrt_male)) n_untrt_females = n_untrt - n_untrt_males

fake <- data.frame(trt = c(rep(0, n_untrt), rep(1, n_trt)), sex = c(rep(0, n_untrt_females), rep(1, n_untrt_males), rep(0, n_trt_females), rep(1, n_trt_males))) fake$old <- NA for (i in seq_along(fake$trt)) { if (fake$trt[i] + fake$sex[i] == 2){ fake$old[i] <- rbinom(1, 1, .1) fake$bp[i] <- rnorm(1, 100, 20) } else if (fake$trt[i] + fake$sex[i] == 0) { fake$old[i] <- rbinom(1, 1, .9) fake$bp[i] <- rnorm(1, 280, 50) } else { fake$old[i] <- rbinom(1, 1, .5) fake$bp[i] <- rnorm(1, 175, 8) } }

imbalance(fake$trt, fake, drop = "trt") Returns this output:

Multivariate Imbalance Measure: L1=0.954 Percentage of local common support: LCS=13.0%

Univariate Imbalance Measures:

statistic   type    L1      min     25%      50%      75%      max

sex -0.5850 (diff) 0.585 0.00000 0.000 -1.0000 -1.0000 0.0000 old 0.6410 (diff) 0.641 0.00000 1.000 1.0000 1.0000 0.0000 bp 143.4229 (diff) 0.000 85.02359 121.272 158.6741 153.2892 233.4842 The L1 of bp is 0.000, which seems impossible. I can replicate it with almost any continuous variable in R, but when running this in Stata, the L1 (both univariate and multivariate) seems appropriately calculated.

— Reply to this email directly or view it on GitHub.

siacus avatar Nov 05 '15 20:11 siacus

Sure. Thanks for prompt response:

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.1 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] tcltk     stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] cem_1.1.17      lattice_0.20-33

loaded via a namespace (and not attached):
[1] MASS_7.3-44         combinat_0.0-8      tools_3.2.2         nlme_3.1-122        grid_3.2.2         
[6] randomForest_4.6-12 MatchIt_2.4-21     

mkiang avatar Nov 05 '15 21:11 mkiang

Perhaps this should be a completely separate issue, but it also appears in Stata 13 that imbalance performs complete case analysis and thus results in a different L1 than R if there is any missing data.

mkiang avatar Nov 05 '15 23:11 mkiang

Hi @siacus, just wondering if there was any movement in this area? I'm planning on taking a closer look over the next few weeks and possibly coding my own (slower, less generalized, less efficient) method — don't want to duplicate effort if you've spotted the problem.

mkiang avatar Feb 26 '16 21:02 mkiang

Hi Mathew, I’m a bit late on this. I suggest you take a look at the come in CEM package and try to fix it. Then you send me the code and I’ll try to optimize and integrate in the new release of CEM with acknowledgment. What do you think? Stefano

Il giorno 26 feb 2016, alle ore 22:16, Mathew Kiang [email protected] ha scritto:

Hi @siacus https://github.com/siacus, just wondering if there was any movement in this area? I'm planning on taking a closer look over the next few weeks and possibly coding my own (slower, less generalized, less efficient) method — don't want to duplicate effort if you've spotted the problem.

— Reply to this email directly or view it on GitHub https://github.com/IQSS/cem/issues/1#issuecomment-189485967.

siacus avatar Feb 27 '16 02:02 siacus

Sure. I can give it a go but it'll take a while -- not a lot of bandwidth for the next couple weeks. 

Sent from my iPhone

On Fri, Feb 26, 2016 at 6:53 PM -0800, "Stefano Iacus" [email protected] wrote:

Hi Mathew,

I’m a bit late on this.

I suggest you take a look at the come in CEM package and try to fix it. Then you send me the code and I’ll try to optimize and integrate in the new release of CEM with acknowledgment.

What do you think?

Stefano

Il giorno 26 feb 2016, alle ore 22:16, Mathew Kiang [email protected] ha scritto:

Hi @siacus https://github.com/siacus, just wondering if there was any movement in this area? I'm planning on taking a closer look over the next few weeks and possibly coding my own (slower, less generalized, less efficient) method — don't want to duplicate effort if you've spotted the problem.

Reply to this email directly or view it on GitHub https://github.com/IQSS/cem/issues/1#issuecomment-189485967.

— Reply to this email directly or view it on GitHub.

mkiang avatar Feb 27 '16 03:02 mkiang

After a few hours, I can't seem to find the issue here. My sense is it around the xtabs() command and continuous variables, but I don't have the bandwidth to look closer. Just going to leave it open, but noting I won't be able to investigate further.

mkiang avatar Mar 21 '16 22:03 mkiang