Samuel Wilson issues

Results 9 issues of


                                            Samuel Wilson

Clean up randomness management

Right now random state management is all over the place, and inelegant, especially inside the default mean matching functions. See if most random processes can be switched over to use...

enhancement

Support more dataset formats

lightgbm can handle the following. No reason we can't add: H2O DataTable's Frame scipy.sparse Look into lightgbm.Sequence... might be no point.

enhancement

Allow generation of imputation values on command

Actually storing the latest imputation values can take up a lot of memory. We have all the information we need to generate imputation values when complete_data() is called, why not...

enhancement

Process finished with exit code -1073741571 (0xC00000FD)

Getting this reliably. Chased it down to scipy.Spatial.KDtree. Changing leafsize doesn't help. candidate_preds is float64, shape (294695, 1).

bug

Add categorical columns to plotting

For correlations, could be the % of matching imputed categories. For distributions, could be a bar/boxplot (depending on datasets?) of the histograms values.

enhancement

Add Bayesian optimization to parameter tuning

Have had good experience with Bayesian optimization in the past. Lightweight implementation: https://github.com/fmfn/BayesianOptimization

enhancement

Limit outer batch size by default

Right now it is set to the size of the data. If the data is even remotely large (10000 rows), this will cause Python to run out of memory.

Implement Feature Groups

Feature groups save a huge amount of time if they are sensible by improving slicing and making coalition combinations smaller.

Pooling functionality

mice and mitml allow analysis pooling, look into adding this functionality.

enhancement