NVTabular
NVTabular copied to clipboard
[BUG] GroupBy does not aggregate correctly vector features
Describe the bug Vector features are wrongly aggregated by GroupBy operator.
Steps/Code to reproduce bug
import pandas as pd
import nvtabular as nvt
from nvtabular import ops as nvt_ops
df = pd.DataFrame(dict(
user_id=[1, 1, 2, 3], user_vector=[[1,2,3], [1,2,3], [2,2,3], [3,2,3]],
item_id=["a", "b", "a", "b"], item_vector=[[10,20,30], [20,40,60], [10,20,30], [20,40,60]],
))
df
output = df.columns >> nvt_ops.Groupby(groupby_cols="user_id", aggs=dict(user_vector="first", item_id="list", item_vector=list))
workflow = nvt.Workflow(output)
ds = nvt.Dataset(df)
out = workflow.fit_transform(ds).compute()
out
Now, the behaviour is:
out["user_vector_first"] == [1, 3, 1]
which does not make sense.
Expected behavior
out["user_vector_first"] == [[1,2,3], [2,2,3], [3,2,3]]
The correct output is given by pandas:
df.groupby("user_id").agg(
user_vector=("user_vector", lambda x: x.iloc[0]),
item_id=("item_id", list),
item_vector=("item_vector", list),
).reset_index()
Environment details (please complete the following information):
- Environment location: [Bare-metal, Docker, Cloud(specify cloud provider) Bare-metal
- Method of NVTabular install: [conda, Docker, or from source] conda
-
nvtabular==1.4.0, rapids==22.08, cudf==22.06
Additional Context: This was not an issue with an environment create with rapids==22.04.