rsample icon indicating copy to clipboard operation
rsample copied to clipboard

initial_split with strata produces very uneven division

Open jttoivon opened this issue 1 year ago • 0 comments

Hi,

If I split a dataframe using initial_split stratifying by an imbalanced two-level factor, I get very uneven division. Below is an example using synthetic data.

library(dplyr)
library(rsample)
set.seed((1845))
prevalence <- 0.0011
n <- 400000
cases <- floor(prevalence*n)
i <- sample(1:n, cases)
df <- tibble(id=1:n) %>%
  mutate(diagnose=factor(if_else(row_number() %in% i, "case", "control")))

summary(df)


mysplit <- initial_split(df, prop=0.5, strata = "diagnose")
train <- training(mysplit)
test <- testing(mysplit)

train %>% count(diagnose)
test %>% count(diagnose)

Since there are 440 cases, I would expect about 220 cases in train and 220 cases in test. However, I get counts 206 and 234 instead. I would understand that a rounding error could lead to off-by-one differences, but this is too great a difference. How is this possible?

My rsample version is 1.2.1.

jttoivon avatar Apr 29 '24 15:04 jttoivon