scikit-tree Implement binning of `X` as an optional preprocessing step for trees

First, let us add the binning capabilities into all forest methods:

extratrees
randomforest

And as a result, BaseDecisionTree and DecisionTreeClassifier will have the necessary sklearn code, so we can use binning as well in Oblique trees (including morf trees), and unsupervised trees.

Reference discussion on preliminary work done: https://github.com/neurodata/scikit-learn/issues/23

Feb 06 '23 02:02 adam2392

Downstream it will be useful to verify that:

all unit tests pass
compare w/ and w/o binning in terms of accuracy/roc_score/precision vs fit-time and predict-times. I think this can be multiple plots with x-axis being the runtime (fit or predict) and the y-axis being the scoring metric. We should replicate this for each type of trees on a variety of different n_samples and n_features.

This could become a solid computing conference paper, or low-hanging fruit software paper regarding improvement of decision trees.

Feb 06 '23 02:02 adam2392

Relevant paper: https://arxiv.org/pdf/1609.06119.pdf

Feb 06 '23 02:02 adam2392

See related discussion in scikit-learn: https://github.com/scikit-learn/scikit-learn/issues/5212

Feb 06 '23 02:02 adam2392

The lab has received a request to build this binning tree in order to analyze thousands of medical data to test numerous hypothesis. ~~@jshinm will start working on this~~

Mar 11 '23 00:03 jshinm

This has been completed naively in upstream scikit-learn-tree (i.e. the fork of scikit-learn repo)

Apr 26 '23 01:04 adam2392

@adam2392 what is the deal with binning in scikit tree? can we bin? if not, can we please at least add an issue to bin?

Sep 26 '23 15:09 jovo

We can bin naively already for all forests. That is, we bin at the Python API level. As to whether or not this improves matters is another experimental issue. We need someone to run a side-by-side experiment w/ and w/o binning for increasing feature dimension and sample size.

To fully enable this feature, we would need to figure out a design to add this into Cython code cleanly. That is, someone must implement the logic to check n_bins splits, rather than naively checking all n_samples splits. A simple glance at the code as of right now, I don't see a super simple way of adding this.

I am happy to code review for someone who wants to add this though and do the experiments.

Sep 26 '23 15:09 adam2392

I don't understand. Are you saying one could naively implement bin per node in Python, and that code would be easy to write? Can you show us the code for how to do that? At first, we could simply test for speed if that works.

Sep 26 '23 15:09 jovo

I don't understand. Are you saying one could naively implement bin per node in Python, and that code would be easy to write? Can you show us the code for how to do that? At first, we could simply test for speed if that works.

Yes that is correct. I implemented this in scikit-learn fork a few months ago: https://github.com/neurodata/scikit-learn/blob/679c9a2ef7a560424781f2a210889e21c7125734/sklearn/ensemble/_forest.py#L533-L563

This is available in all scikit-tree forests as a result.

My point re cython code is that this hack in Python is not enough potentially; it might not even help because the Cython code is still naively sorting/searching all n_samples features, rather than n_bins features. We can get even more by going into Cython.

Sep 26 '23 15:09 adam2392

@adam2392 So you hypothesize, without strong empirical results, that this is too slow, for some definition of too? Also, has scikit-learn has finalized their inclusion of missing data and categorical support for decision trees?

@sampan501 You could test this hypothesis on your data and see?

Sep 26 '23 15:09 jovo

@adam2392 So you hypothesize, without strong empirical results, that this is too slow, for some definition of too?

Yes because the Cython code remains the same, so it would be very surprising if there is a significant improvement. I could see the sorting being faster, or the for-loop within Cython being potentially faster due to more similar feature values (because they've been binned), but the expensive for-loop still remains the same.

Just to be clear, I am stating that I am 100% sure we will go from O(n log(n)) to O(n) if we modify the Cython code. But there is no reason to think that we can achieve that speedup with the code I added in Python alone.

Also, has scikit-learn has finalized their inclusion of missing data and categorical support for decision trees?

Re missing data, no, but it seems like it is in-progress. Categorical support I believe is not even being considered at this point unfortunately :/.

Sep 26 '23 15:09 adam2392

@sampan501 might be worth trying on the linear dataset for the power curves, and also might not be. i'd just try it on the 1024 sample size, just the RF part (not the power), and see if it is any faster

Sep 27 '23 02:09 jovo