DecisionTree.jl icon indicating copy to clipboard operation
DecisionTree.jl copied to clipboard

Add option to resample features at nodes without replacement

Open mharradon opened this issue 3 years ago • 1 comments

Hello, thanks for the nice package.

I was working on an application where I wanted perfect prediction in a classification task and found that I was unable to do that with partial_frac = 1.0, which I did not expect. After some investigation it appears that instances are sampled with repetition when constructing forests. As a result, though N samples are included in each individual tree fit, they almost always include duplicates and are missing other values. See e.g.:

https://github.com/JuliaAI/DecisionTree.jl/blob/master/src/regression/main.jl#L104

julia> rand(1:5, 5)
5-element Vector{Int64}:
 5
 5
 2
 2
 3

I think it would be preferable if sampling was performed without repetition, ensuring that the partial_frac = 1.0 limit is exact. I don't know if this is the standard convention for random forests, though.

I would be happy to contribute a PR if it's agreed that non-repeated sampling is preferred.

Thank you!

mharradon avatar Oct 06 '22 22:10 mharradon

I see from some review that sampling with replacement is standard due to theoretical justification, though I think in practice one might prefer either. I see other libraries make the choice of sampling with replacement an exposed argument - that would be a nice option.

mharradon avatar Oct 07 '22 14:10 mharradon