fragile approach to getting names of independent variables?

Open mikoontz opened this issue 3 years ago • 0 comments

Hello! Thanks so much for this package! I'm learning a ton about making inference from random forest models, and I really appreciate the effort you've put into making this more understandable.

I came across an issue when using your package on a {ranger} model built using {spatialRF} when trying to run randomForestExplainer::plot_predict_interaction(). It seems that the method used by {randomForestExplainer} to get the list of dependent variable names is fragile, and can error out if the formula syntax wasn't used to create the {ranger} model.

For instance, with {ranger}, you can build a model like this:

forest_ranger <- ranger::ranger(x = mtcars[, c("mpg", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb")], y = mtcars[, "cyl"])

Which will then error out when trying to run:

plot_predict_interaction(forest_ranger, mtcars, "mpg", "hp")

But it doesn't error out when building the same model using the formula syntax:

forest_ranger <- ranger::ranger(cyl ~ ., data = mtcars)
plot_predict_interaction(forest_ranger, mtcars, "mpg", "hp")

The issue arises in this line in {randomForestExplainer}: https://github.com/ModelOriented/randomForestExplainer/blob/630c4fe9f7ddcc0a9a586dc4c4fc1822e9d30776/R/min_depth_interactions.R#L363

The {spatialRF} package doesn't build the {ranger} model using the formula syntax, so randomForestExplainer::plot_predict_interaction() won't work on the resulting model:

forest_ranger <- spatialRF::rf(dependent.variable.name = "cyl", 
                               predictor.variable.names = c("mpg", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb"), 
                               data = mtcars)
plot_predict_interaction(forest_ranger, mtcars, "mpg", "hp")

I documented this issue and my workaround in the repo for {spatialRF} but I thought I'd add it here, too since it seems like the issue is perhaps more relevant for {randomForestExplainer} and how it captures what the dependent variables are in a {ranger} model.

It looks like, in a {ranger} model, you can get the independent variables directly from the $forest$independent.variable.names component? Maybe this is a more robust way to capture that info for plot_predict_interaction()?

What do you think?

Jul 12 '22 16:07 mikoontz