feature_engine icon indicating copy to clipboard operation
feature_engine copied to clipboard

DecisionTreeDiscretiser to output integers in addition to the predictions

Open solegalli opened this issue 3 years ago • 5 comments

At the moment, the DecisionTreeDiscretiser returns the values of the tree predictions as the replacement of the original variables.

I would like to add the option to return integers from 1 to k (or 0 to k-1), where k is the number of final leaves. The numbers increase with the mean value target per leave.

Not sure how difficult it is to implement, I think we sort of need to navigate the tree somehow, pick up the final values at the leaves, and then create a mapping from final value to the integer. And add a parameter in the init where the user can specify if they want integers or predictions as output.

The addition is suggested after the Self-Guided via CART method available in MINITAB and described here.

solegalli avatar May 10 '22 11:05 solegalli

Really nice one.

I was also using the DecisionTreeDiscretiser to binnarize a model probability prediction to sort of define rules of action depending on the generated bin.

In that case, it could be nice to have the option to return_boundaries within the list of parameters of DecisionTreeDiscretiser.

CleanShot 2022-07-08 at 19 03 35

joaopcnogueira avatar Jul 08 '22 22:07 joaopcnogueira

Hi @joaopcnogueira

Thanks for the detail and the notebook you contributed to the examples repo. And apologies for the delay. I was on holidays.

Would you like to give it a go at making the discretizer return the interval boundaries?

these links may help:

https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html#sphx-glr-auto-examples-tree-plot-unveil-tree-structure-py

https://mljar.com/blog/extract-rules-decision-tree/

https://stackoverflow.com/questions/20224526/how-to-extract-the-decision-rules-from-scikit-learn-decision-tree

solegalli avatar Aug 03 '22 07:08 solegalli

No need to apologize, I completely understand that.

I will give it a try to make the discretizer return the interval boundaries, much likely as shown on the notebook on examples repo.

joaopcnogueira avatar Aug 05 '22 15:08 joaopcnogueira

If DT has random seed locked, it will always return the same values, thus we can use something like stats.rankdata(x, method='dense') to transform floats to integers.

glevv avatar Aug 07 '22 03:08 glevv

I created this question in stack overflow to see how to create boundaries from trees: https://stackoverflow.com/questions/75663472/how-to-obtain-the-interval-limits-from-a-decision-tree-with-scikit-learn

solegalli avatar Mar 07 '23 14:03 solegalli