Strange behavior on covtype dataset

Open omit-ai opened this issue 3 years ago • 1 comments

Hey all,

while fiddling around with tfdf we found a strange behavior of GradientBoostedTree Model on sklearns covtype dataset.

Setup

Environment

TFDF was running von GoogleColab and Multipass.

Multipass:

Ubuntu 20.04.3 LTS
14.4G Disk
3.8G Memory

Hyperparameter

The GradientBoostedTree model was run with following parameter:


dt_kwargs_base = {
    'num_trees':100,
    'growing_strategy':"BEST_FIRST_GLOBAL",
    'max_depth':6,
    'use_hessian_gain':True,
    'sorting_strategy':"IN_NODE",
    'shrinkage':1.,
    'subsample':1.,
    'sampling_method': 'RANDOM',
    'l1_regularization':1.,
    'l2_regularization':1.,
    'l2_categorical_regularization':1.,
    'num_candidate_attributes': -1,
    'num_candidate_attributes_ratio': -1.,
    'min_examples':1,
    'validation_ratio':0.,
    'early_stopping':"NONE",
    'in_split_min_examples_check':False,
    'max_num_nodes': -1,
    'verbose': 0,
}

Dataset

Sklearn - Cov_Type Dataset

Classes: 7
Samples total: 582012
Dimensionality: 54
Features: int

Splitted with sklearns train_test_split with test_size=0.2 and random_state=42. We decremented y by 1, so that the value range was right.

Result of predictions of test set

After training completed without an issue we got this predictions from test set:

[[nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 ...
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]]

The trees produced did not look suspicious.

This did not happen when we:

Lowered num_trees (i.e. to 50)
Lowered tree depth (i.e. to 3)
Set use_hessian_gain to false
Lowering the shrinkage (i.e. 0.5 or smaller)

So far we have not looked further into it. Maybe you have an idea, why this happened?

Thanks a lot in advance

Best Regards

Timo

Feb 02 '22 14:02 omit-ai

Hi,

It is most likely a numerical accumulation problem. Thanks for the report, I'll take a look at it.

In the mean time, some remarks regarding the hyper-parameters:

sorting_strategy: While this has not impact on the quality of the final model, this will likely make the model train slower. Unless there is a good reason, it is generally good to leave the default parameter.
shrinkage: A shrinkage of 1 is a bit strange. While possible, this will likely make the model's accuracy relatively poor.
{l1,l2}_regularization: While those make sense, the default values are generally better for a first run.
max_num_nodes: Again, while possible, this is a strange value for this corresponding growing_strategy.

Feb 11 '22 15:02 achoum