rsample
rsample copied to clipboard
initial_split with strata produces very uneven division
Hi,
If I split a dataframe using initial_split stratifying by an imbalanced two-level factor, I get very uneven division. Below is an example using synthetic data.
library(dplyr)
library(rsample)
set.seed((1845))
prevalence <- 0.0011
n <- 400000
cases <- floor(prevalence*n)
i <- sample(1:n, cases)
df <- tibble(id=1:n) %>%
mutate(diagnose=factor(if_else(row_number() %in% i, "case", "control")))
summary(df)
mysplit <- initial_split(df, prop=0.5, strata = "diagnose")
train <- training(mysplit)
test <- testing(mysplit)
train %>% count(diagnose)
test %>% count(diagnose)
Since there are 440 cases, I would expect about 220 cases in train and 220 cases in test. However, I get counts 206 and 234 instead. I would understand that a rounding error could lead to off-by-one differences, but this is too great a difference. How is this possible?
My rsample version is 1.2.1.