Hleb Levitski comments

Results 106 comments of


                                            Hleb Levitski

RandomForestClassifier RAM memory increase on newer versions

Could be connected to addition of `max_samples` feature that generates several int arrays of shape (N, 1), but I'm not sure

sklearn.svm.SVR use more RAM memory on newer versions

Could be connected to #8361 and calculating `var` on entire dataset when `gamma=scale`. If true, I don't think there is anything that could be done

DecisionTreeDiscretiser to output integers in addition to the predictions

If DT has random seed locked, it will always return the same values, thus we can use something like `stats.rankdata(x, method='dense')` to transform floats to integers.

feat: Kmeans Encode Categorical Feature

From the looks of it, it is the same as `sklearn.preprocessing.KBinsDiscretizer` with `strategy='kmeans'`

feat: Kmeans Encode Categorical Feature

@solegalli sorry, my bad, I misunderstood OP. It is transformation of k features to 1. So it just regular fit_predict of KMeans on subset of data. Which is something like...

[FEATURE] Categorical Variable Concatenation

Yes, it's a rerun of a #84 If someone has an implementation of this, it should be tested on OOM issues as well as speed first, since there could be...

MAD-Median rule for outlier removal

We can translate multipliers to percentiles. For example 2 and 3 fold for gaussian method is a 2-sigma, 3-sigma rules which is 95.5% and 99.7% respectively ([wiki article if someone...

MAD-Median rule for outlier removal

Ok, I will make draft PR soon

Adding auto threshold to DropHighPSIFeatures

@gverbock yes, threshold will depend on size of dataset (and on `split_frac`) and number of bins, which makes sense since larger number of bins would be more sensitive to changes...

Adding auto threshold to DropHighPSIFeatures

@solegalli ye, why not