projpred length of penalty for factors

Hi, when there is a factor as a predictor in a brms model and when a penalty vector is supplied to cv_varsel(), the variable selection seems to have issues with the penalty vector; this behavior started after the latest brms update (see session info below).

    # from the tutorial
    library(projpred)
    library(brms)
    data('df_gaussian', package = 'projpred')
    split_structure <- break_up_matrix_term(y ~ x, data = df_gaussian)
    df_gaussian <- split_structure$data
    formula <- split_structure$formula
    d <- df_gaussian
    n <- nrow(df_gaussian) # 100
    D <- ncol(df_gaussian[, -1]) # 20
    p0 <- 5 # prior guess for the number of relevant variables
    tau0 <- p0/(D-p0) * 1/sqrt(n) # scale for tau (notice that stan_glm will automatically scale this by sigma)

    # Make a factor out of one variable <--------------------
    df_gaussian$x20 <- as.factor(sample(0:1, nrow(df_gaussian), rep=T))

    fit <- brm(formula, family=gaussian(), data=df_gaussian,
               prior=prior(horseshoe(scale_global = tau0, scale_slab = 1), class=b),
               seed=1, chains=2, iter=500)
    # make penalty
    betas <- grep("^b", parnames(fit), value=TRUE)[-1]
    penalty <- sample(0:1, length(betas), rep = T)
  
    cvs <- cv_varsel(fit, method = "L1", penalty = penalty)
    # >    
    # [1] "Computing LOOs..."  
    # Error in glm_elnet(d_train$x, mu, family, lambda_min_ratio = opt$lambda_min_ratio,  :      
    # Incorrect length of penalty vector (should be 21). 

    # making the penalty one longer
    cvs <- cv_varsel(fit, method = "L1", penalty = c(penalty,1))
    # >
    # warning: solve(): system seems singular; attempting approx solution

Session Info

R version 4.0.2 (2020-06-22)                                                             
Platform: x86_64-w64-mingw32/x64 (64-bit)                                                
Running under: Windows 10 x64 (build 18363)                                              
                                                                                         
Matrix products: default                                                                 
                                                                                         
locale:                                                                                  
[1] LC_COLLATE=German_Switzerland.1252  LC_CTYPE=German_Switzerland.1252                 
[3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C                                     
[5] LC_TIME=German_Switzerland.1252                                                      
                                                                                         
attached base packages:                                                                  
[1] stats     graphics  grDevices utils     datasets  methods   base                     
                                                                                         
other attached packages:                                                                 
[1] brms_2.14.4    Rcpp_1.0.5     projpred_2.0.2

Nov 27 '20 08:11 JanaJarecki

As you noticed at the end, you have to provide a penalty value for the intercept as well. Nonetheless that seems to result in a singular system, maybe because of the penalty added. I would add the intercept penalty to the beginning of the vector, but instead of adding 1 I would add maybe 1e-6, so you don't penalise the intercept but the length matches what the underlying routine expects.

Can you try and report back?

Nov 27 '20 08:11 AlejandroCatalina

Quickest response ever ;) tried it - it works (here we go)

Out of curiosity: we ran the intercept-less code approx 1 month ago without errors. After the latest brms update (I believe) we encountered the error. Any chance you can elaborate on the internal workings of this?

    cvs <- cv_varsel(fit, method = "L1", penalty = c(1e-6, penalty))
``

Nov 27 '20 09:11 JanaJarecki

Great!

In the past month or so we have drastically updated the whole codebase and we always include the intercept now. That means you can fit an intercept-less brms model (although generally not recommended) but projpred is still expecting an intercept, which may be the reason why it is now failing if you don't add an explicit penalty term for it. I can probably include it automatically as you call varsel because the intercept cannot actually be penalised.

Nov 27 '20 09:11 AlejandroCatalina

Just commenting here that the underlying glm_elnet function does not expect penalty for the intercept, so the length of the penalty-vector should be equal to the number of columns in x that this function receives. It might be a bug somewhere else. Or did you @AlejandroCatalina introduce penalization also for the intercept in version 2.0?

Nov 27 '20 09:11 jpiironen

No, so I didn't mean that we should include penalisation for the intercept. What happens internally is that we introduce the intercept as parameter explicitly, and therefore glm_elnet believes there are D + 1 coefficients without intercept. That's the reason it's expecting a D + 1 length vector. I will workaround it internally so that the user does not need to indicate anything for the intercept.

Nov 27 '20 10:11 AlejandroCatalina

Ok, I see. It would also make sense to me that this would happen automatically inside projpred so that the user wouldn't need to do it. One question though: was there a reason to do it this way (so augmenting the matrix x with column of ones) instead of using the glm_elnet argument intercept = TRUE?

Nov 27 '20 10:11 jpiironen

There is no specific reason, I believe maybe at that point it made sense to keep up with the formula syntax that implicitly includes the intercept and when you evaluate model.matrix or model.frame it automatically includes the column of 1s.

Nonetheless, there is another situation that may extend the vector of coefficients to fit when there are factor predictors that internally are broken up between their contrasts, so we actually fit more coefficients that originally present. So all of this should be handled internally of course. I'll write a TODO item to address this more elegantly.

Nov 27 '20 10:11 AlejandroCatalina