iml icon indicating copy to clipboard operation
iml copied to clipboard

Update Shapley.R

Open TomasZdrazil opened this issue 5 years ago • 4 comments

When creating y.hat.diff$feature.value, it takes the colnames of x.interest and just adds it as another column. However, the order of the colnames of x.interest may be different than the order of the same features in y.hat.diff$feature, therefore vlookup is needed instead of just appending the column. For this the auxiliaryTab is created that takes the feature names from the x.interest and then the merge function is used to assign the correct feature.value to the corresponding feature.

TomasZdrazil avatar Oct 23 '20 14:10 TomasZdrazil

Thanks for this pull request. The tests don't run through, it seems that now the Shapley values don't add up to the difference in the test, as the should.

christophM avatar Oct 23 '20 14:10 christophM

Hi, thank you for your comment, I checked it and it seems like the Shapley values don't add up to the difference even by default, running the iml_0.10.1. Might that be an issue in your package? Not sure. See the code that I tested it attached. The R session info: R version 3.6.1 (2019-07-05), Platform: x86_64-w64-mingw32/x64 (64-bit), Running under: Windows 10 x64 (build 18362)

ShapleySumTest.txt

TomasZdrazil avatar Oct 27 '20 13:10 TomasZdrazil

It does add up, but only in expectation, meaning that when you increase the sample.size in Shapley$new, you will get closer to the difference.

The test for Shapley to add up can be found here: https://github.com/christophM/iml/blob/master/tests/testthat/test-Shapley.R

christophM avatar Oct 27 '20 13:10 christophM

Thanks, I will have to look into that more deeply as for my data they do not add up and the gap is quite big, the actual difference is more than twice the sum of Shapley values, had sample.size = 3000.

Anyway, this request aimed to tackle other issue, and that is the fact that in case the order of columns in the training data (predictor$data$X) is not the same as in the record to explain (x.interest) the result is misleading, as the table shapley$results has the columns feature and feature.value with different values, e.g. for 1 line the feature specified in feature is not the same as specified in feature.value. This results for example in wrong visual, because shapley$plot uses feature.value as the label so the values of phi get visualised for wrong feature. Attached the script demonstrating this issue.

ShapleyColsOrderTest.txt

Can you confirm this behaviour? I guess the workaround is to order the columns manually for both datasets before running the Shapley values analysis, but I thought it would be more elegant to have this implemented in the function directly, as an user may not know this requirement.

TomasZdrazil avatar Oct 27 '20 16:10 TomasZdrazil