VariantSpark Generate importances for random permutations of labels

Add functionality to generate variable importances for a selected number of random permutations of labels.
Output variable variable importance for each permutation.

Jun 26 '18 01:06 piotrszul

Suggested specification:

add a new command line command null-importance with parameters similar to importance with regards to input and random forest specification.
additional parameters to specify the number of permutation
add option to get all variable (without top)
as the primary output produce a csv file with the with variables in rows and importances from permutation in columns (see the example below)
use 'NA' if a variable was not selected in the n-th permutation (either was not important or was not in the top k important)
since computation of the permutation can be a long running process consider saving of intermediate results and running continuation, with (with the same random forest parameters) but possibly different number of permutations (only not exiting permutation should be recomputed).
provide options to save the models and permutation labels in the intermediate output.

Example output:

,perm_0, perm_1, perm_2, ....
var_1, 0.0332, 0.03232, 0.003232, ...
var_3, NA, 0.23232, 0093232, ...

Jun 26 '18 02:06 piotrszul

Seems good in general.

I think should address the issue of number of importance score per feature (as I discussed with you on slack). Therefore, I think specifying the number of permutations is very important indeed.
Getting all variables might not be feasible, especially when we want to have enough permutations so that we get multiple scores for every variable, so I think that option is not so necessary (at least not for me), but on the other hand, specifying a max number of variables could be good. In this case we want to run the random forest once with the original data, select the top X variables, and make the rest of the permutations with those variables only. This way we can save computation while getting enough scores per variable.
the output format seems perfect to me

Jun 29 '18 02:06 amnonbleich

By 'gettgin all variables' I mean not so much getting values for all of them (as this may not be possible as you mentioned) but rather not limiting the number so that we output all available. Currently (I think) for the importance command you need to specify mandatory top.

But perhaps we should also add an option to subset the variables to be included in permutations (e.g. the output from running on actual labels)

Jun 29 '18 03:06 piotrszul

oh so yes, it shouldn't be mandatory to select top, I agree.

and I agree about the option of selecting number of variables to include in the permutations too. In this case we should check importances based on one run of variantSpark with original labels and then continue the permutation importance with a subset of the X most important variables.

Jul 02 '18 05:07 amnonbleich

OK. Once the method crystalized we may make it more integrated but for now I suggest that we two commands: importance and null-importance

In both of them we will have an option to:

limit or not limit the number or top variables
provide a file with the (subset of) variables to use (output from the importance command can be used here). I think it many be necessary to re-run importance with selected variables as the importances on the subset my be different that on all variables.

So the procedure for now will be:

run importance on all variables (this should probably have may trees and low mtry not to exclude potentially significant variables). Select top N important one.
run importance on important subset
run null-importance on important subset.
run analysis p-value estimation etc. possibly further narrow variable set. .. repeat 2,3,4

Finally, there is a question on how the importances are nomalized (or on the flip side what normalization you need for this method to work). Currently I think they are nomalized to add 100.0 but this may not be appropriate here.

Jul 03 '18 00:07 piotrszul

sounds good. the options for giving a file with the variables to use is perfect.

regarding the p-values and the normalization of the importances - The p-values are computed upon the null distribution of the importance of each variable. therefore we need many importance scores for each variable (to fit the distribution) i.e many null-importance iterations. If done in this way, we don't need to normalize the scores as they are all p-values, which is in the scale 0f 0-1.

And for computing the tail of the null-distributions I indeed thought of recursively subselecting the top X variables from the importance with the original score and making more and more permutations, the less variables we have. But anyways, we can discuss it after the basic version is ready

Jul 03 '18 01:07 amnonbleich

The normalization question is not about the p-values but the importances you use to compute them. As I said at the currently all the importances will normalized in such a way that they add up do 100.0. This will be true for both the null importances (for each permutations) and for the actual importances. Is that OK or do you need something else (maybe raw importance scores?)

Jul 03 '18 01:07 piotrszul

I don't understand why we need to apply normalization on the importances...

Jul 03 '18 01:07 amnonbleich

That's actually a very good question.

I do not think that we need, but we certainly do (in the current code), but I do not remember what the motivation for this was (maybe to get results comparable so some other implementations - I will need to dig it out).

Anyway I think that you may need to raw importance scores (what do you think?) and if so I will and an option to report them.

Jul 03 '18 02:07 piotrszul

you mean the not normalized ones? if yes, well.. I don't see a reason at the moment to need those, but it's always a good idea to keep the raw results. Thanks!

Jul 04 '18 06:07 amnonbleich