Samuel Wilson
Samuel Wilson
Right now random state management is all over the place, and inelegant, especially inside the default mean matching functions. See if most random processes can be switched over to use...
lightgbm can handle the following. No reason we can't add: H2O DataTable's Frame scipy.sparse Look into lightgbm.Sequence... might be no point.
Actually storing the latest imputation values can take up a lot of memory. We have all the information we need to generate imputation values when complete_data() is called, why not...
Getting this reliably. Chased it down to scipy.Spatial.KDtree. Changing leafsize doesn't help. candidate_preds is float64, shape (294695, 1).
For correlations, could be the % of matching imputed categories. For distributions, could be a bar/boxplot (depending on datasets?) of the histograms values.
Have had good experience with Bayesian optimization in the past. Lightweight implementation: https://github.com/fmfn/BayesianOptimization
Right now it is set to the size of the data. If the data is even remotely large (10000 rows), this will cause Python to run out of memory.
Feature groups save a huge amount of time if they are sensible by improving slicing and making coalition combinations smaller.
mice and mitml allow analysis pooling, look into adding this functionality.