Strange behavior on covtype dataset
Hey all,
while fiddling around with tfdf we found a strange behavior of GradientBoostedTree Model on sklearns covtype dataset.
Setup
Environment
TFDF was running von GoogleColab and Multipass.
Multipass:
- Ubuntu 20.04.3 LTS
- 14.4G Disk
- 3.8G Memory
Hyperparameter
The GradientBoostedTree model was run with following parameter:
dt_kwargs_base = {
'num_trees':100,
'growing_strategy':"BEST_FIRST_GLOBAL",
'max_depth':6,
'use_hessian_gain':True,
'sorting_strategy':"IN_NODE",
'shrinkage':1.,
'subsample':1.,
'sampling_method': 'RANDOM',
'l1_regularization':1.,
'l2_regularization':1.,
'l2_categorical_regularization':1.,
'num_candidate_attributes': -1,
'num_candidate_attributes_ratio': -1.,
'min_examples':1,
'validation_ratio':0.,
'early_stopping':"NONE",
'in_split_min_examples_check':False,
'max_num_nodes': -1,
'verbose': 0,
}
Dataset
Sklearn - Cov_Type Dataset
- Classes: 7
- Samples total: 582012
- Dimensionality: 54
- Features: int
Splitted with sklearns train_test_split with test_size=0.2 and random_state=42. We decremented y by 1, so that the value range was right.
Result of predictions of test set
After training completed without an issue we got this predictions from test set:
[[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
...
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]]
The trees produced did not look suspicious.
This did not happen when we:
- Lowered num_trees (i.e. to 50)
- Lowered tree depth (i.e. to 3)
- Set use_hessian_gain to false
- Lowering the shrinkage (i.e. 0.5 or smaller)
So far we have not looked further into it. Maybe you have an idea, why this happened?
Thanks a lot in advance
Best Regards
Timo
Hi,
It is most likely a numerical accumulation problem. Thanks for the report, I'll take a look at it.
In the mean time, some remarks regarding the hyper-parameters:
- sorting_strategy: While this has not impact on the quality of the final model, this will likely make the model train slower. Unless there is a good reason, it is generally good to leave the default parameter.
- shrinkage: A shrinkage of 1 is a bit strange. While possible, this will likely make the model's accuracy relatively poor.
- {l1,l2}_regularization: While those make sense, the default values are generally better for a first run.
- max_num_nodes: Again, while possible, this is a strange value for this corresponding growing_strategy.