Semi-partial and inclusive R2
Just saw this preprint that cites performance:
https://www.biorxiv.org/content/biorxiv/early/2020/07/26/2020.07.26.221168.full.pdf
Here, we introduce partR2, an R package that quantifies part R2 for fixed effect predictors based on (generalized) linear mixed-effect model fits. The package iteratively removes predictors of interest and monitors the change in R2 38 as a measure of the amount of variance explained uniquely by a particular predictor or a set of predictors. partR2 also estimates structure coefficients as the correlation between a predictor and fitted values, which provide an estimate of the total contribution of a fixed effect to the overall prediction, independent of other predictors. Structure coefficients are converted to the total variance explained by a predictor, termed ‘inclusive’ R2 , as the square of the structure coefficients times total R2.
https://github.com/mastoffel/partR2
Wouldn't this rather be something for effectsize? @mattansb ?
True since it's parameters-specific if I understand
This is really interesting! @DominiqueMakowski isn't this really what would solve #127 ?
Mmh possibly, didn't fully read it yet... I guess it could be a very good candidate if it works with Bayesian models and cherry on the top with other GLMs like logistic models... but indeed the interpretation would be relatively straightforward
partR2 takes a fitted (generalized) linear mixed-model (GLMM), from the popular mixed model package lme4 (Bates et al., 2015) and estimates part R2 by iterative removal of fixed effects (Nimon et al., 2008). [...] The central function partR2 will work for Gaussian, Poisson and binomial GLMMs. Since the model fit is done externally, there is no need to supply a family argument. For non-Gaussian GLMMs, the package estimates link-scale R2 (sensu Nakagawa & Schielzeth, 2013).
Sounds awesome. And the code looks okay, i.e., no obscure C scripts or some heavy mandatory dependencies.
Maybe we could contact its author and see if he's okay if we (re)implement the method here (correctly referenced of course)? We could try reimplementing that without the tidyverse functions, and hopefully extending it to other packages (glmmTMB, stanreg etc)
Based on their example (https://github.com/mastoffel/partR2#example), it seems that their function returns the part R2 for all the combinations of fixed effects, but we could by default return the R2 of the parameters present in the model (so that it's more focused).
If I understand correctly, this requires re-fitting the model for each parameter / term. So very costly, and not very practical for Bayesian models.
Also, this sounds (haven't read the whole thing yet) very close, conceptually, to a type-III Eta square (not partial), no?
If we go forward with this, I think contacting the author is a good idea.
So very costly, and not very practical for Bayesian models
bummer! Maybe a solution could be found with projpred, which as far as I understand is specialized in this kind of stuff (though it's outside of my/our scope probably).
Also, this sounds (haven't read the whole thing yet) very close, conceptually, to a type-III Eta square (not partial), no?
🤷♂️ why not partial tho?
Maybe a solution could be found with projpred
This also sounds interesting! Stop adding to my reading list!
why not partial tho?
Because the method described in the partR2 paper is akin to a semi-partial (part) correlation - the relation between the predictor (controlling for other predictors) and the outcome (not controlling for other predictors),
which is closer to (but is not identical to) eta2=SSeffect/SStotal, than to etap2=SSeffect/(SSeffect + SSerror).
@bwiernik Do you have any input on this?
It should be relatively straightforward to implement spR2 for basic lm() models, no? We might at least start there..
They appear to have reinvented dominance analysis, which computes the average squared semipartial correlation (change in R2) when adding a predictor to a model (or conditionally when specific other predictors are in the model).
I think we should implement this to report deltaR2() and semipartialR() in effectsize or performance (or both) for just the variables as last entered into the model. For this case, we might require users to to specify a predictor to do the refitting for, rather than doing all variables, for computational reasons. We could provide an "all" option or document find_predictors(model) as a way to get everything, with a caution that this might be time consuming. We could also all the variables to be given a list of variable combinations to enter or remove as a block.
You can look at the yhat and dominanceanalysis packages for implementation of dominance analysis. That should go into parameters I think. https://github.com/easystats/parameters/issues/479
A wrinkle in dominance analysis is that models with interactions of polynomial terms should not have higher order terms without the lower order ones.
Here is a thesis on dominance analysis in Bayesian models. Not sure if it gives any ideas on increasing efficiency. It seems like there ought to be some efficiency improvements that could be realized ala PSIS-LOO-IC. https://www.academia.edu/7062364/Dominance_Analysis_Utilizing_the_Bayesian_Framework_in_Linear_and_Logistic_Regression_Models
For large or Bayesian models, doing a random subset of the possible models seems more reasonable cf http://www.metafor-project.org/doku.php/plots:gosh_plot
See also Shou, Y., & Smithson, M. (2015). Evaluating Predictors of Dispersion:A Comparison of Dominance Analysis and Bayesian Model Averaging. Psychometrika, 80(1), 236-256.
For intervals on deltaR2
HTTPS://DOI.org/10.1177/00131640121971392 HTTPS://DOI.org/10.1177/0013164406292030 HTTPS://DOI.org/10.1037/1082-989x.4.1.70 HTTPS://DOI.org/10.1177/0013164407313371 HTTPS://DOI.org/10.22237/jmasm/1209614460 HTTPS://DOI.org/10.1177/0013164410379335 HTTPS://DOI.org/10.1007/s11336-04-1221-6
Really sounds like you want to issue assigned to you, B (: Care to have a go? (Make some initial rough PR, and then we can collab on the fine tuning?) (There is no rush at all to get this done)
sure
Wait, are you telling me that you didn't run down a 50 paper rabbit hole about confidence intervals for R-squared when you prepped your intro regression class teaching materials?
I need this badly. In my line of research, as effect size, we need to report the sr2 (semi-partial correlation squared, also known as the delta R-square)—the variance uniquely explained by a given regression term.
So I have implemented it in my rempsyc package, but at the cost of adding a huge (my biggest) dependency, for a single function (car::Anova), simply to obtain this effect size. It was originally in lmSupport::modelEffectSizes. So I realized I could probably get it from easystats::effectsize (already a dependency), but I see it is not implemented yet. Would be very nice!! Bonus points if it would have the confidence interval too!
You are more than welcome to add it here (:
If I understand correctly, this requires re-fitting the model for each parameter / term. So very costly, and not very practical
Is the best way to calculate it by hand to do the full model, get the R2, then remove one term at a time, get the R2 each time, and subtract those from the original R2? Or is there a more efficient way? I guess for very large models it could be slow but if there's no other way, than surely that should be how other packages calculate it.
This is just dominance analysis--we've already got that in parameters.
Oh dang… I got carried away and submitted a PR already…
That said, I just tried parameters:: dominance_analysis but seems like it doesn’t support interactions…
# Testing parameters
library(parameters)
m <- lm(mpg ~ cyl + disp + hp * drat, data = mtcars)
dominance_analysis(m)
#> Error: Interactions in the model formula are not allowed.
Created on 2022-10-02 with reprex v2.0.2</sup](https://reprex.tidyverse.org%29%3C/sup)>
From the pkgdown site:
The input model is parsed using insight::find_predictors(), does not yet support interactions
@rempsyc I'm curious about the (non-)overlap to dominance_analysis().
- Could you please post an example (w/o interaction) and see if results between your PR and
dominance_analysis()are the same? - Would it be possible / make sense to integrate your PR into
dominance_analysis(), to make that function work with interactions?
Sure. Since they don’t have the same names for the same concepts, I’m unsure of which values to extract. But let’s start with the new one, then let’s look for those same values in dominance_analysis.
devtools::load_all(".")
model <- lm(vs ~ cyl + carb + mpg, data = mtcars)
a <- sr2(model)
a
#> Parameter sr2
#> 1 (Intercept) NA
#> 2 cyl 0.193396279
#> 3 carb 0.033902703
#> 4 mpg 0.008503627
b <- parameters::dominance_analysis(model)
b
#> # Dominance Analysis Results
#>
#> Model R2 Value: 0.694
#>
#> General Dominance Statistics
#>
#> Parameter | General Dominance | Percent | Ranks | Subset
#> ------------------------------------------------------------
#> (Intercept) | | | | constant
#> cyl | 0.380 | 0.548 | 1 | cyl
#> carb | 0.134 | 0.193 | 3 | carb
#> mpg | 0.180 | 0.259 | 2 | mpg
#>
#> Conditional Dominance Statistics
#>
#> Subset | IVs: 1 | IVs: 2 | IVs: 3
#> ---------------------------------
#> cyl | 0.657 | 0.290 | 0.193
#> carb | 0.324 | 0.044 | 0.034
#> mpg | 0.441 | 0.089 | 0.009
#>
#> Complete Dominance Designations
#>
#> Subset | < cyl | < carb | < mpg
#> -------------------------------
#> cyl | | FALSE | FALSE
#> carb | TRUE | |
#> mpg | TRUE | |
We’re looking for the sr2 values .19, .03, and .01. So seems like it corresponds to the IVs: 3 column.
a$sr2[-1] == b$Conditional$IVs_3
#> [1] TRUE TRUE TRUE
Created on 2022-10-02 with reprex v2.0.2
As to your second point: maybe, but I'm not convinced. Here's why.
First, I had never heard of dominance analysis before, so I'm not sure it's a term very well-known. In my network everyone uses semi-partial correlation squared and most don't even know it's the same as delta R2. So there might be a discoverability issue. Like you're looking for an effect size, so you expect to find it in the package of the same name. But instead, you have to dig into parameters, discover that the right function is called dominance_analysis, and then guess that what you're looking for (the sr2) is actually called IVs: 3. These terms are not (currently) in the documentation, so hard to find, I'd say.
Second, the output from dominance analysis is pretty comprehensive, it's like for a full analysis/report. My vision for the sr2 function was like for the other effect size functions: it just gives you precisely what you asked. That is also more useful for programming. In this context, that's fine that there's some overlap IMO.
In this context, that's fine that there's some overlap IMO
Sounds reasonable. WDYT @bwiernik ?
Second, the output from dominance analysis is pretty comprehensive, it's like for a full analysis/report. My vision for the sr2 function was like for the other effect size functions: it just gives you precisely what you asked. That is also more useful for programming. In this context, that's fine that there's some overlap IMO.
There are arguments to report just the general, conditional, or complete dominance stats alone.
In this context, that's fine that there's some overlap IMO
Sounds reasonable. WDYT @bwiernik ?
Mostly was just looking to minimize redundant code. Fine with an alias to return just the effect size, but we should try to reduce duplicated code
An ideal scenario might indeed be an alias like bwiernik suggests, something like:
sr2 <- function(model) {
y <- parameters::dominance_analysis(model, complete = FALSE)
y$Conditional$IVs_3
}
model <- lm(vs ~ cyl + carb + mpg, data = mtcars)
sr2(model)
#> [1] 0.193396279 0.033902703 0.008503627
Created on 2022-10-02 with reprex v2.0.2
Three thoughts though.
- One, seems like the name of the column (e.g.,
IVs_3) changes depending on number of terms, so would need to account for that. - Second, I don't immediately see how to ask dominance analysis to compute only the sr2 (best I could do was set
completeargument toFALSE; is there an argument to set general toFALSE?). Since I have to make it compute a bunch of other things and then subset, it might be less efficient than a dedicated function. - Third, and more importantly, I don't know why it doesn't support interactions, but there might be a good reason, such as being complicated (looking at the code, looks like it could be). The lack of support for interactions could be due to specific aspects of dominance analysis, or to some of the other outputs, or to non-lm models, which I have no knowledge of, which would make it harder to solve. Whereas the other function is immediately available and functional. If I did know how to easily add support for interactions in
dominance_analysisthough I might do that.
I wonder to what extent and a what level dominance analysis and the delta r2 relate. From the convo, it seems like dominance analysis is based off of it and then does other stuff?
If that's the case, maybe we could have delta r2 implemented and exported, and then have dominance analysis re-use it and expand on it? Rather than having delta R2 be re-use dominance analysis (as it seems to be the higher-order application). So that we minimize code duplication and maintenance burden.
Also, naive question, do delta r2 work for bayesian models?
@jluchman
Also, naive question, do delta r2 work for bayesian models?
Current implementation does not a allow. I'm not sure how it would work, priors might get messy when fitting sub-models...
I wonder to what extent and a what level dominance analysis and the delta r2 relate. From the convo, it seems like dominance analysis is based off of it and then does other stuff?
As was noted above, a semi-partial correlation (for a linear regression with explained variance R2) is the last/3rd column of results from the 'Conditional Dominance Matrix' in the example. In fact, they will always be the last column in the matrix irrespective of the number of predictors - of course which column that is will change (but that's not a hard programming problem to solve!).
If that's the case, maybe we could have delta r2 implemented and exported, and then have dominance analysis re-use it and expand on it? Rather than having delta R2 be re-use dominance analysis (as it seems to be the higher-order application). So that we minimize code duplication and maintenance burden.
Dominance analysis will have a lot of overhead to compute the semi-partial correlations that increases with the number of predictors as it is an 'all subsets model' approach (all 2^p combinations of models are run where p is the number of predictors (see this vignette for a discussion).
Don't recommend wrapping dominance_analysis to get the semi-partial correlation as a result. May increase maintenance burden but the decrease in performance to use it for this purpose is, IMO, not worth it.
Worth noting that the pratt method in relaimpo::calc.relimp, although not the semi-partial correlation, is a potentially faster way to get a value like it (and like the General Dominance Statistics) than mimicking what dominance_analysis does. Again though, it is not the same value.
Also, naive question, do delta r2 work for bayesian models?
Actually has been some work I'm familiar with to extend importance into the Bayesian realm with Bayes factors. See this and this.
Is worth noting that the topic of what a semi-partial vs dominance statistic "is" and how well it reflects importance is a topic that has been covered extensively in the literature (see this). Also worth noting that the semi-partial correlation does not partition the R2 into variance accounted for by a specific predictor but is one part of doing so using the dominance analysis/Shapley value decomposition method. Not so say it's wrong or not useful as a metric, but as the linked article notes, presents conceptual problems when inferring 'importance' from those decompositions. Those issues are only truly relevant if such 'importance' inferences are being made and that may not be the case in typical usage of the semi-partial correlation metric as intended here.
@fmg-jluchman how much work would it be approximately to add support for interactions and confidence intervals to parameters::dominance_analysis (or your domir package if it depends on that)?
Performance wise: I tend to agree with @fmg-jluchman - I think it would be best to have this as a separate function, and for the time being limit it to lm() models - fitting subsets of these should be relatively fast and easy to do.
The comparable step in dominance analysis is the conditional step with k - 1 terms. That's the incremental R2 when adding the term to all the other terms. We should add an argument to perform just specific conditional steps to the dominance analysis function.
It's not clear to me what semipartial correlations should be reported in a model with interactions--if I have a + b + a:b, should the product term be included in the base model when computing incremental validity for a?