R^2 measures decomposition?
There are several packages out there, like rego for Stata that provide a Shapley/Owen decomposition for R^2 and R^2 types (adjusted etc.) goodness-of-fit measures.
See this paper for example: https://projecteuclid.org/journals/electronic-journal-of-statistics/volume-6/issue-none/Axiomatic-arguments-for-decomposing-goodness-of-fit-according-to-Shapley/10.1214/12-EJS710.full
and rego package page: http://www.marco-sunder.de/stata/rego.html
Related. https://github.com/easystats/parameters/issues/479
Acknowledging @bwiernik 's concern about importance metrics and their use- and situation-specific interpretation, if the performance package authors are potentially interested Shapley and Owen decomposition, I have a relatively new package I have been working on (under my other alias @jluchman here: https://github.com/jluchman/domir) and would be happy to share/adapt to work within the easystats suite of packages.
Technically, domir is a dominance analysis package (as its focus is on dominance designations) but the general dominance statistics it produces are equivalent to Shapley values (and Owen decomposition using the sets argument). This package's goal is to move beyond linear regression (focus of yhat and relaimpo) to apply Shapley value decomposition and dominance analysis to any statistical or machine learning model's fit to data (and hence could use any/all of the metrics offered by performance)
The package I humbly offer for consideration is still very much in development but, if interested, would be happy to collaborate with the authors here to pull/wrap it in and add needed caveats to contextualize the kinds of insights it may provide for the user by way of the insights it adds about the model's results.
I would love to have dominance functions in performance! Wrappers for your package would be great
Pleased to hear it! Look forward to collaborating to bring this in - will start on a fork
That sounds great indeed :) Looking forward to that!
From a design standpoint, I don't know what you have in mind for bringing closer together easystats and domir, however since your package looks to be already functional and well documented, I could imagine several possible scenarios:
- performance wraps around domir and uses it for some of its functions (domir becomes a soft dependency of performance).
- some of the heavy-lifting of domir gets transposed into performance (i.e., the core computing functions for dominance analysis are shared between domir and perf). This would allow leveraging the rest of the ecosystem (in particular insight) to easily extend and expand. This means that performance becomes a dependency of domir, which would still have its place as a dedicated package with a specific focus, documentation and other bespoke functions for printing/ploting etc.
- domir gets (mostly) "absorbed" within easystats/performance. You'd become the leader of the dominance analysis aspect of easystats, and we could have dedicated vignettes and blogposts covering this set of functions.
Perhaps it's still a bit too soon to talk about how to actually integrate stuff, so no worries if you prefer to be on a "we'll see how it goes" line ☺️
Good question and have been putting some thought into this.
My initial thought was probably most similar to the first scenario - but, agreed, may make more sense as the integration proceeds to shift to a more transposed or "absorptive" approach depending on the details :smiley:.
Moving forward, if I have development or design questions I'd like your collective input on, would you all prefer to continue discussion of such questions in this issue thread?
Sure this thread is a good spot
Hi All,
Have a candidate initial version of a dominance analysis function that I could initiate as a pull request but thought might be better to discuss here first to refine as needed. The repo I'm referring to throughout is: https://github.com/jluchman/performance
The added function is called dominance_analysis() and attempts to follow from the usage of r2() in that it accepts a model object directly and works behind the scenes to parse the input and produce a result. The approach taken is based on the "domir as soft dependency" idea where domir has been added to the "Suggests" list and leans on insight to pull out the formula, predictors, and data from the model object, format them in a way domir::domin expects and uses and performance::r2 to obtain the R2.
Notes/Comments
- Limited number of models are supported at current
Those supported mostly those that have a single outcome and one predictive equation (e.g., lm, glm, polr, multnom, survreg). Those which are not supported should throw an error; such errors are built in if a model is not an insight or performance::r2 supported model or are in a model class that is not currently dominance analyzable in a general way in domin - a list of these is outlined in the documentation. There are others that may cause more cryptic errors (i.e., svyglm) and I've tried to explain why such models might fail (i.e., no data argument in svyglm) but am not familiar with all the models performance supports and would surely miss some specifics.
- All predictors in the model object are used as dominance analysis factors
The current build does not allow for groups of (i.e., like Owen decomposition; sets argument in domin) or removal of covariates (all argument in domin). Could be built in but, in my experience, most researchers/analysts use all predictors as dominance analysis factors and these are features which could be added in later revisions once other bugs are worked out.
- Simple model formulas only
No in-formula offsets, interactions, or transformations (e.g., log(x)) are allowed. Model also requires the use of a data argument for the formula's terms to refer to.
- R v > 3.5
domir::domin has only been tested in R's >= 3.5 and dominance_analysis consistently fails checks in R v 3.4 and stems from an incompatibility in domir::domin. I figured it would be simpler to disallow dominance_analysis for R v's < 3.5 and not hold up progress in moving forward. Have done little to diagnose what the issue is but appears to be triggered within domir::domin with the use of colSums() and would likely need a fix within the underlying dominance analysis computation methods to fix (or so I assume).
Have also added tests, a print method, and adjusted the spelling document for new words. Have been able to get all automated checks, that are affected by the additions, to pass based on what I'm seeing on GitHub Actions.
Also plan to add an argument that a user could supply to quote data masked arguments in the model object (e.g., weights) to ensure they're evaluated at model call time and not as passed over in dominance_analysis; at current they throw an error. Similarly, may change the way data are called and/or disallow otherwise valid models that do not have a data argument for the initial version.
Again, happy to initiate a pull request, respond to comments, and implement requested edits prior to a pull request as needed. Looking forward to starting to get this function build into performance!
Sure, why don't you open a pull request and I can look at it and comment ?
Happy to and will do shortly.