length of penalty for factors
Hi,
when there is a factor as a predictor in a brms model and when a penalty vector is supplied to cv_varsel(), the variable selection seems to have issues with the penalty vector; this behavior started after the latest brms update (see session info below).
# from the tutorial
library(projpred)
library(brms)
data('df_gaussian', package = 'projpred')
split_structure <- break_up_matrix_term(y ~ x, data = df_gaussian)
df_gaussian <- split_structure$data
formula <- split_structure$formula
d <- df_gaussian
n <- nrow(df_gaussian) # 100
D <- ncol(df_gaussian[, -1]) # 20
p0 <- 5 # prior guess for the number of relevant variables
tau0 <- p0/(D-p0) * 1/sqrt(n) # scale for tau (notice that stan_glm will automatically scale this by sigma)
# Make a factor out of one variable <--------------------
df_gaussian$x20 <- as.factor(sample(0:1, nrow(df_gaussian), rep=T))
fit <- brm(formula, family=gaussian(), data=df_gaussian,
prior=prior(horseshoe(scale_global = tau0, scale_slab = 1), class=b),
seed=1, chains=2, iter=500)
# make penalty
betas <- grep("^b", parnames(fit), value=TRUE)[-1]
penalty <- sample(0:1, length(betas), rep = T)
cvs <- cv_varsel(fit, method = "L1", penalty = penalty)
# >
# [1] "Computing LOOs..."
# Error in glm_elnet(d_train$x, mu, family, lambda_min_ratio = opt$lambda_min_ratio, :
# Incorrect length of penalty vector (should be 21).
# making the penalty one longer
cvs <- cv_varsel(fit, method = "L1", penalty = c(penalty,1))
# >
# warning: solve(): system seems singular; attempting approx solution
Session Info
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=German_Switzerland.1252 LC_CTYPE=German_Switzerland.1252
[3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C
[5] LC_TIME=German_Switzerland.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] brms_2.14.4 Rcpp_1.0.5 projpred_2.0.2
As you noticed at the end, you have to provide a penalty value for the intercept as well. Nonetheless that seems to result in a singular system, maybe because of the penalty added. I would add the intercept penalty to the beginning of the vector, but instead of adding 1 I would add maybe 1e-6, so you don't penalise the intercept but the length matches what the underlying routine expects.
Can you try and report back?
Quickest response ever ;) tried it - it works (here we go)
Out of curiosity: we ran the intercept-less code approx 1 month ago without errors. After the latest brms update (I believe) we encountered the error. Any chance you can elaborate on the internal workings of this?
cvs <- cv_varsel(fit, method = "L1", penalty = c(1e-6, penalty))
``
Great!
In the past month or so we have drastically updated the whole codebase and we always include the intercept now. That means you can fit an intercept-less brms model (although generally not recommended) but projpred is still expecting an intercept, which may be the reason why it is now failing if you don't add an explicit penalty term for it. I can probably include it automatically as you call varsel because the intercept cannot actually be penalised.
Just commenting here that the underlying glm_elnet function does not expect penalty for the intercept, so the length of the penalty-vector should be equal to the number of columns in x that this function receives. It might be a bug somewhere else. Or did you @AlejandroCatalina introduce penalization also for the intercept in version 2.0?
No, so I didn't mean that we should include penalisation for the intercept. What happens internally is that we introduce the intercept as parameter explicitly, and therefore glm_elnet believes there are D + 1 coefficients without intercept. That's the reason it's expecting a D + 1 length vector. I will workaround it internally so that the user does not need to indicate anything for the intercept.
Ok, I see. It would also make sense to me that this would happen automatically inside projpred so that the user wouldn't need to do it. One question though: was there a reason to do it this way (so augmenting the matrix x with column of ones) instead of using the glm_elnet argument intercept = TRUE?
There is no specific reason, I believe maybe at that point it made sense to keep up with the formula syntax that implicitly includes the intercept and when you evaluate model.matrix or model.frame it automatically includes the column of 1s.
Nonetheless, there is another situation that may extend the vector of coefficients to fit when there are factor predictors that internally are broken up between their contrasts, so we actually fit more coefficients that originally present. So all of this should be handled internally of course. I'll write a TODO item to address this more elegantly.