[BUG] GroupBy does not aggregate correctly vector features

Open fdtomasi opened this issue 3 years ago • 0 comments

Describe the bug Vector features are wrongly aggregated by GroupBy operator.

Steps/Code to reproduce bug

import pandas as pd
import nvtabular as nvt
from nvtabular import ops as nvt_ops

df = pd.DataFrame(dict(
    user_id=[1, 1, 2, 3], user_vector=[[1,2,3], [1,2,3], [2,2,3], [3,2,3]],
    item_id=["a", "b", "a", "b"], item_vector=[[10,20,30], [20,40,60], [10,20,30], [20,40,60]],
))

df
output = df.columns >> nvt_ops.Groupby(groupby_cols="user_id", aggs=dict(user_vector="first", item_id="list", item_vector=list))
workflow = nvt.Workflow(output)
ds = nvt.Dataset(df)
out = workflow.fit_transform(ds).compute()

out

Now, the behaviour is: out["user_vector_first"] == [1, 3, 1] which does not make sense.

Expected behavior out["user_vector_first"] == [[1,2,3], [2,2,3], [3,2,3]]

The correct output is given by pandas:

df.groupby("user_id").agg(
    user_vector=("user_vector", lambda x: x.iloc[0]),
    item_id=("item_id", list),
    item_vector=("item_vector", list),
).reset_index()

Environment details (please complete the following information):

Environment location: [Bare-metal, Docker, Cloud(specify cloud provider) Bare-metal
Method of NVTabular install: [conda, Docker, or from source] conda
nvtabular==1.4.0, rapids==22.08, cudf==22.06

Additional Context: This was not an issue with an environment create with rapids==22.04.

Sep 14 '22 12:09 fdtomasi