interpret icon indicating copy to clipboard operation
interpret copied to clipboard

Large datasets: support for Dask arrays?

Open Hoeze opened this issue 4 years ago • 5 comments

Hi, I tried training a ExplainableBoostingRegressor using Dask arrays, but I keep running into the following issue:

ERROR:interpret.utils.all:Could not unify data of type: <class 'dask.array.core.Array'>

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-53-b5e1d33190e1> in <module>
      1 model = create_model()
      2 
----> 3 fold_models, train_preds, valid_preds = model_cv(model)

<ipython-input-48-178d0374bff6> in model_cv(model)
     17 
---> 18             fold_model = sklearn.clone(model).fit(x_train_fold, y_train_fold)
     19 

/opt/anaconda/envs/ebm/lib/python3.8/site-packages/interpret/glassbox/ebm/ebm.py in fit(self, X, y)
    744         # TODO: PK don't overwrite self.feature_names here (scikit-learn rules), and it's also confusing to
    745         #       user to have their fields overwritten.  Use feature_names_out_ or something similar
--> 746         X, y, self.feature_names, _ = unify_data(
    747             X, y, self.feature_names, self.feature_types, missing_data_allowed=False
    748         )

/opt/anaconda/envs/ebm/lib/python3.8/site-packages/interpret/utils/all.py in unify_data(data, labels, feature_names, feature_types, missing_data_allowed)
    325         msg = "Could not unify data of type: {0}".format(type(data))
    326         log.error(msg)
--> 327         raise ValueError(msg)
    328 
    329     new_labels = unify_vector(labels)

ValueError: Could not unify data of type: <class 'dask.array.core.Array'>

Each of my folds is a 2D array consisting of 56 features and occupying ~16GB of memory. Passing model.fit(X.compute(), y.compute() crashes memory after some time, probably because of Joblib copying data around unnecessarily.

Hoeze avatar Jun 08 '21 17:06 Hoeze

Hi @Hoeze,

Thanks for bringing this up! We haven't looked into explicitly supporting Dask before, but it seems to be worth investigating. It's not too difficult to fix this specific error in our utility function -- we'd just need to add a type check for Dask arrays there -- but there might be other issues in other parts of the algorithm that would be unmasked when we do this. At first glance, it does seem promising -- Dask's hook-in for Joblib's scheduler might make this easier to support than other large scale parallelization frameworks.

Unfortunately there's no exact timeline on when we can investigate this deeply, but we'll use this issue to track any progress we make on it.

-InterpretML Team

interpret-ml avatar Jun 14 '21 13:06 interpret-ml

+1 for some support of distributed processing. Possibly, using ray for this would serve several purposes:

  • It could unify local/distributed processing (where local/multiprocess mode does not require a ray cluster)
  • It would be compatible with other frameworks that run on top of ray, including dask

MainRo avatar Jun 22 '21 10:06 MainRo

+1 for some support of distributed processing. Possibly, using ray for this would serve several purposes:

For the record, I'm also relying on dask-on-ray. However, dask is the common standard. Ray just provides a distributed scheduler for dask :)

Hoeze avatar Jun 22 '21 10:06 Hoeze

Hi @Hoeze,

Thanks for bringing this up! We haven't looked into explicitly supporting Dask before, but it seems to be worth investigating. It's not too difficult to fix this specific error in our utility function -- we'd just need to add a type check for Dask arrays there -- but there might be other issues in other parts of the algorithm that would be unmasked when we do this. At first glance, it does seem promising -- Dask's hook-in for Joblib's scheduler might make this easier to support than other large scale parallelization frameworks.

Unfortunately there's no exact timeline on when we can investigate this deeply, but we'll use this issue to track any progress we make on it.

-InterpretML Team

Don't forget about Spark (PySpark) :) There's a joblib backend for Spark too: https://github.com/joblib/joblib-spark

Also see: https://github.com/interpretml/interpret/issues/243

candalfigomoro avatar Aug 04 '21 13:08 candalfigomoro

Another vote for dask here

onacrame avatar Feb 20 '22 07:02 onacrame