FLAML AutoML() doesn't seem to use Ray's object store (for large datasets)

Hi there,

I have a large dataset (+100GB) that I have been trying to make FLAML (AutoML) work with, no success so far. Since FLAML uses Ray, shouldn't it take advantage of Ray's object store (object spill to disk)? If not, any suggestions on how we should go about out-of-memory compute with FLAML?

Dec 27 '21 15:12 ottobricks

When I try to pass a Ray objectRef to AutoML's fit, I get an error that either a Numpy array, Pandas DataFrame or Scipy sparse matrix is expected.

Dec 27 '21 15:12 ottobricks

@ottok92 How do you perform training currently? If you have a working training function already, you can use flaml.tune to perform hyperparameter tuning.

Dec 27 '21 17:12 sonichi

Thank you for the quick reply, @sonichi. Currently I do everything in Spark with Scala. I'm interested in using FLAML both because of the impressive CFO algorithm and also to make it easier for my colleagues to collaborate (everybody knows Python). I'll go through the doc you suggested to see if this will enable us to run FLAML on our large datasets. Thanks for the support!

Dec 28 '21 06:12 ottobricks

It looks very promising to integrate with Ray's object store. Thank you for the suggestion. I will run some experiments and post feedback in this thread for future reference.

Dec 28 '21 08:12 ottobricks

@ottok92 That's great. I'm very interested in how it works with your use case. Another question is what learner do you use, for example, lightgbm? flaml has built-in search space for the built-in learners, which might be useful. For example, here is an example of tuning lgbm: https://github.com/microsoft/FLAML/blob/main/test/tune_example.py To make it work for your dataset, you can modify the train_lgbm function, metric, mode, time_budget_s, and set use_ray=True if you would like to do parallel tuning.

Dec 28 '21 16:12 sonichi

Perfect! I'm working with XGBoost, which is also built-in. Once I finish playing with this, I will share my train_xgboost function. Maybe we can create a section in the Docs for "handling large datasets with Ray".

Dec 29 '21 06:12 ottobricks

Perfect! I'm working with XGBoost, which is also built-in. Once I finish playing with this, I will share my train_xgboost function. Maybe we can create a section in the Docs for "handling large datasets with Ray".

That'll be super cool. Looking forward to it.

Dec 29 '21 16:12 sonichi

BTW, flaml provides two search spaces for XGBoost. XGBoostSklearnEstimator tunes "max_leaves", and XGBoostLimitDepthEstimator tunes "max_depth".

Feb 04 '22 19:02 sonichi