[bug] fix MT2203 RNG non-uniformity and random bin indices in decision forest training

Open icfaust opened this issue 2 years ago • 8 comments

Description

Each MT2203 RNG engine is independently uniform when taking samples. However, when two or more engines are compared, the initial aggregated random numbers are not uniform. Because the randomness between trees needs to be guaranteed for the decision forest algorithm (each with its own RNG engine), a imperceptible performance loss is introduced to burn RNG values to where the engine collection is empirically uniform.

The second issue is a problem with the binary search associated with finding a split for ExtraTrees (regressor and classifier). The search failed to find the largest bin left edge in its current orientation, and so it has been switched to always guarantee a valid split. This change comes from the ambiguity of using a binning approach with the Extra Trees algorithm definition. All use of the .min parameter are removed, and so it is completely removed from IndexedFeatures and initial binning scripts.

This will fix the following deselected_tests from sklearnex: tests/test_multioutput.py::test_classifier_chain_tuple_order ensemble/tests/test_forest.py::test_distribution

However, this is changing the determinism of the trees used in the sklearnex tests, which means some tests which passed by chance could now fail.

This non-uniformity negatively impacts both the random forest in the bootstrapping process, and in extra trees in the initial chosen splits.

Changes proposed in this pull request:

Check for a family engine (only MT2203)
Burn a magic number of samples for every engine (400)
remove .min() from IndexedFeatures
change binary search in genRandomBinIdx for classification and regression

Nov 21 '23 12:11 icfaust

/intelci: run

Nov 21 '23 13:11 icfaust

special private CI run with mentioned tests enabled: http://intel-ci.intel.com/ee886edb-4adb-f114-a7aa-a4bf010d0e2e

Nov 21 '23 13:11 icfaust

Performance comparisons are available upon request

Nov 21 '23 13:11 icfaust

The test_distribution for ExtraTreesRegressor is not fixed. This requires further analysis.

Nov 21 '23 14:11 icfaust

special private CI run with mentioned tests enabled: http://intel-ci.intel.com/ee891bc7-4b31-f12d-a430-a4bf010d0e2e

Nov 22 '23 10:11 icfaust

/intelci: run

Nov 22 '23 10:11 icfaust

Private CI shows this causes an issue with ExtraTreesClassification now, will need to investigate

Nov 22 '23 10:11 icfaust

/intelci: run ml_benchmarks

Jan 19 '24 16:01 icfaust