Question regarding prediction for levels of a covariate that aren't in the sample
I'm trying to understand the predict() function in dbarts. The below code chunk tries to predict for observations in test_dat, however a lot of the observations test_dat contain levels in the covariate rank that aren't observed in the sample. Hence the last line of the code chunk below doesn't run.
library(car)
library(dbarts)
dat = Salaries
bartFit = bart(x.train = dat[,c("rank","discipline","sex")],
y.train = dat$salary,
keeptrees=TRUE)
test_dat = expand.grid(rank=c(levels(dat$rank), paste0(1:10000,"_A")),
discipline=levels(dat$discipline),
sex=levels(dat$sex))
bart_prediction = predict(bartFit,newdata = test_dat,
type="ppd") # doesn't run
However, the below code chunk runs:
dat_morelevels = Salaries
levels(dat_morelevels$rank) = c("AsstProf", "AssocProf", "Prof", paste0(1:10000,"_A"))
bartFit_morelevels = bart(dat_morelevels[,c("rank","discipline","sex")],
y.train = dat_morelevels$salary,
keeptrees=TRUE)
bart_prediction = predict(bartFit_morelevels,newdata = test_dat,
type="ppd") # runs
I noticed that bartFit_morelevels and bartFit are exactly the same model, however bartFit_morelevels is able to sample from the posterior predictive distribution for the extra levels paste0(1:10000,"_A") in the covariate rank.
What does levels(dat_morelevels$rank) = c("AsstProf", "AssocProf", "Prof", paste0(1:10000,"_A")) actually do to bartFit_morelevels if bartFit_morelevels and bartFit are exactly the same model?
Thanks for pointing this out. Technically, with one-hot encoding for factor variables the trees can correctly sort observations for new levels. I'll look into enabling it. The only way in which it would not be correct is if it trains on a factor with only two levels, since those are encoded in a slightly different manner.
Great, thanks for the reply! I have a follow-up clarification question: Do the posterior predictive samples in predict(bartFit_morelevels,newdata = test_dat, type="ppd") for observations with a new level for covariate rank come from the in-sample variation of c("AsstProf", "AssocProf", "Prof")?