VariantSpark Add support for continuous variables

Add support for continuous variables for trees.

Aug 29 '17 03:08 piotrszul

The optimal solution for this could be (for each variable) sorting samples by the value of that variable and consider all possible split (between any two adjecent sample in the sorted list). This is computationally intensive and requiers to check S-1 splits if there are S samples.

The optimised and approximate solution is bin sorted samples and consider split point only between bins. B+1 split if there are B bins. But how to group samples into bins.

I belive the best way to bin samples is to run a clustering algorithem on all samples (for each variable). However in practice it might be computational intensive, especially if the number of variables is too high.

The alternative approach is to consider either fix-length (value of variable) or fixed-size (number of samples) bins.

fix-lenght bin are really inefficient as some bins migh have to many sample and some bins be empty. Although empty bins can be eliminated yet there might be many bins with a few samples.

fix-size bin seems to be more logical. There would be more bin where samples are concenterated and less bins on other regions. The problem is about bins that take place between concentration points. They may include samples with large difference in the value. Given N as a size of a bin (number of samples in each bin) N=S/B, if N is samll enough then the negative effecet on accuracy will be reduced. However very low N leads to high number of bins B that results in more split points and increases computation. Finding an optimal N would be a problem.

My suggested optimisation is: To chose largest possible N and then compute the Variance of each bin (value of variable for each of sample in the bin). Then we compute the average Variance and if the Variance of a bin is significantly larger than the average Variance we can split it into two smaller bin with half size of the original bin (This process can be done in multiple round) With no doubt, there would be extra computation to calculate variance. How much faster it is compared to other option? What about the accuracy? these are question to answer.

Jul 06 '18 06:07 ArashBayatDev

Done

Feb 13 '24 00:02 rocreguant