partd icon indicating copy to clipboard operation
partd copied to clipboard

Support for pandas DataFrame subclasses

Open jorisvandenbossche opened this issue 4 years ago • 2 comments

When dask uses partd for eg shuffle operations, the dataframes always come back as a pandas.DataFrame, even if a subclass was stored (xref https://github.com/geopandas/dask-geopandas/issues/59#issuecomment-864469674).

For example:

import geopandas
gdf = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))

import partd
# dask.dataframe shuffle operations use PandasBlocks
p = partd.PandasBlocks(partd.Dict())

p.append({"gdf": gdf})
res = p.get("gdf")

>>> type(gdf)
pandas.core.frame.DataFrame
>>> type(res)
pandas.core.frame.DataFrame

To be able to use dask's shuffle operations with dask_geopandas, which uses a pandas subclass as the partition type, the subclass should be preserved in the partd roundtrip (or are there other ways that you can override / dispatch this operation in dask?). I was wondering how other dask.dataframe subclasses handle this, but eg dask_cudf doesn't seem to support "disk"-based shuffling.

jorisvandenbossche avatar Jun 21 '21 08:06 jorisvandenbossche

The pandas.DataFrame is recreated here:

https://github.com/dask/partd/blob/9c9ba0a3a91b6b1eeb560615114a1df81fc427c1/partd/pandas.py#L194-L203

Currently only the actual underlying data is stored, and so when deserializing, I don't think partd knows anything about the original class? So if we want to change that, it would need to start store some additional information?

An alternative might be to tackle this on the dask side, and ensure the retrieved part is of the same type as meta (eg could do something like res = meta._constructor(res) at https://github.com/dask/dask/blob/8aea537d925b794a94f828d35211a5da05ad9dce/dask/dataframe/shuffle.py#L740-L744)

jorisvandenbossche avatar Jun 21 '21 08:06 jorisvandenbossche

cc @madsbk @quasiben for their GPU-shuffle connection

jrbourbeau avatar Jun 23 '21 23:06 jrbourbeau