Update Shapley.R
When creating y.hat.diff$feature.value, it takes the colnames of x.interest and just adds it as another column. However, the order of the colnames of x.interest may be different than the order of the same features in y.hat.diff$feature, therefore vlookup is needed instead of just appending the column. For this the auxiliaryTab is created that takes the feature names from the x.interest and then the merge function is used to assign the correct feature.value to the corresponding feature.
Thanks for this pull request. The tests don't run through, it seems that now the Shapley values don't add up to the difference in the test, as the should.
Hi, thank you for your comment, I checked it and it seems like the Shapley values don't add up to the difference even by default, running the iml_0.10.1. Might that be an issue in your package? Not sure. See the code that I tested it attached. The R session info:
R version 3.6.1 (2019-07-05), Platform: x86_64-w64-mingw32/x64 (64-bit), Running under: Windows 10 x64 (build 18362)
It does add up, but only in expectation, meaning that when you increase the sample.size in Shapley$new, you will get closer to the difference.
The test for Shapley to add up can be found here: https://github.com/christophM/iml/blob/master/tests/testthat/test-Shapley.R
Thanks, I will have to look into that more deeply as for my data they do not add up and the gap is quite big, the actual difference is more than twice the sum of Shapley values, had sample.size = 3000.
Anyway, this request aimed to tackle other issue, and that is the fact that in case the order of columns in the training data (predictor$data$X) is not the same as in the record to explain (x.interest) the result is misleading, as the table shapley$results has the columns feature and feature.value with different values, e.g. for 1 line the feature specified in feature is not the same as specified in feature.value. This results for example in wrong visual, because shapley$plot uses feature.value as the label so the values of phi get visualised for wrong feature. Attached the script demonstrating this issue.
Can you confirm this behaviour? I guess the workaround is to order the columns manually for both datasets before running the Shapley values analysis, but I thought it would be more elegant to have this implemented in the function directly, as an user may not know this requirement.