skrub icon indicating copy to clipboard operation
skrub copied to clipboard

cuml implementation of SuperVectorizer, GapEncoder, SimilarityEncoder

Open dcolinmorgan opened this issue 3 years ago • 2 comments

engine flag to enable cuml-based implementation of class functions

Benefits to the change: gpu-based speedup

Naive pseudocode for the new behavior (realistically much tougher to implement):

if self.engine == 'cuml':
  from cuml.cluster import KMeans
  from cuml.feature_extraction.text import CountVectorizer, HashingVectorizer
  from cuml.neighbors import NearestNeighbors
  from cuml.preprocessing import OneHotEncoder

dcolinmorgan avatar Sep 20 '22 08:09 dcolinmorgan

It's not absolutely obvious that a naive implementation would lead to speed-ups.

Anyhow, an important question for this alley to be possible: how would we do CI (continuous integration)?

GaelVaroquaux avatar Sep 20 '22 10:09 GaelVaroquaux

Maybe a useful, more declarative question is: What might be a good path to enabling plugging in custom handlers, and limiting selection to them?

In our case, sometimes we want CPU-only, sometimes GPU-only (ex: an end-to-end GPU pipeline), and sometimes, ambivalent (e.g., care more about quality). There are other variants of this, like local vs remote, dask_cpu vs dask_cudf, competing impls of same alg, .... .

Our use is pretty limited:

E.g.,

https://github.com/graphistry/pygraphistry/blob/97b52e22b4d9c17b40a4a33fed115dd789c981e7/graphistry/feature_utils.py#L892

https://github.com/graphistry/pygraphistry/blob/97b52e22b4d9c17b40a4a33fed115dd789c981e7/graphistry/feature_utils.py#L944

lmeyerov avatar Sep 27 '22 02:09 lmeyerov