xgboost icon indicating copy to clipboard operation
xgboost copied to clipboard

Send null values down both branches

Open aeftimia opened this issue 3 years ago • 1 comments

WIP

Potential means of handling missing data that avoids overfitting. See feature-request #8249 and this link for the basic concept.

I added a new boolean into the training parameters (better names are, of course, welcome) and took a crack at refactoring evaluate_splits.h to incorporate the algorithm when searching for splits with continuous features. The idea is instead of simply doing a backward sweep when there is a discrepancy between left_sum and the total gradient and hessian, do a backward sweep while adding 1/(iend-ibegin) times that discrepancy to the left_sum for each iteration in the sweep. This way, the residual gradients and Hessians from null values are added to left_sum incrementally throughout the sweep.

If how the code was refactored seems acceptable, I can repeat the pattern on methods that handle categorical and one hot encoded feature, then move on to inference and testing.

@trivialfis

aeftimia avatar Sep 21 '22 20:09 aeftimia

Thank you for working on this. I will look deeper into the PR

trivialfis avatar Sep 21 '22 20:09 trivialfis