[BUG] `Catetegorify` can't process `vocabs` correctly when `num_buckets>1`
Describe the bug
nvt.ops.Categorify don't process vocabs correctly when num_buckets>1 is given simultaneously.
Steps/Code to reproduce bug
I tried to use categorify transform with pre-defined vocabs.
I also have to consider multiple oov, so I also gives num_buckets>1 for parameter.
from merlin.core import dispatch
import pandas as pd
import nvtabular as nvt
df = dispatch.make_df(
{
"Authors": [["User_A"], ["User_A", "User_E"], ["User_B", "User_C"], []],
"Post": [1, 2, 3, 4],
}
)
cat_names = ["Authors"]
label_name = ["Post"]
vocabs = {"Authors": pd.Series([f"User_{x}" for x in "ACBE"])}
cat_features = cat_names >> nvt.ops.Categorify(
num_buckets=2, vocabs=vocabs, max_size = {"Authors": 8},
)
workflow = nvt.Workflow(cat_features + label_name)
df_out = workflow.fit_transform(nvt.Dataset(df)).to_ddf().compute()
For above code, expected index for each values are like below.
- pad: [0]
- null: [1]
- oov : [2,3]
- unique: [4,5,6,7]).
But, I get following result with wrong category dictionary.
-
df_out
| Authors | Post | |
|---|---|---|
| 0 | [7] | 1 |
| 1 | [ 7 10] | 2 |
| 2 | [9 8] | 3 |
| 3 | [] | 4 |
-
pd.read_parquet("./categories/meta.Authors.parquet")
| kind | offset | num_indices | |
|---|---|---|---|
| 0 | pad | 0 | 1 |
| 1 | null | 1 | 1 |
| 2 | oov | 2 | 1 |
| 3 | unique | 3 | 4 |
-
pd.read_parquet("./categories/unique.Authors.parquet")
| Authors | |
|---|---|
| 3 | User_A |
| 4 | User_C |
| 5 | User_B |
| 6 | User_E |
I check inside of Categorify.process_vocabs function and oov_count can get num_buckets correctly.
But when process_vocabs function call Categorify._save_encodings(), it doesn't make the vocabulary dictionary correctly.
Expected behavior From https://github.com/NVIDIA-Merlin/NVTabular/blob/77b94a40babfea160130c70160dfdf60356b4f16/nvtabular/ops/categorify.py#L432-L438
I fix the code whereprocess_vocab call Categorify._save_encodings with oov_count.
def process_vocabs(self, vocabs):
...
oov_count = 1
if num_buckets:
oov_count = (
num_buckets if isinstance(num_buckets, int) else num_buckets[col_name]
) or 1
col_df = dispatch.make_df(vals).dropna()
col_df.index += NULL_OFFSET + oov_count
# before
# save_path = _save_encodings(col_df, base_path, col_name, oov_count=oov_count)
# after
save_path = _save_encodings(col_df, base_path, col_name, oov_count=oov_count)
and I got following result of df_out like as I expected.
| Authors | Post | |
|---|---|---|
| 0 | [4] | 1 |
| 1 | [ 4 7] | 2 |
| 2 | [6 5] | 3 |
| 3 | [] | 4 |
Environment details (please complete the following information):
- Environment location: Bare-metal (CentOS 7)
- Method of NVTabular install:
pip
Additional context None
In all of the applications I've built OOV has been a single embedding and used to represent the fact that the item is new or rare. Can you help me understand the use case? Why would you want multiple OOV values. They're so rare that they'll effectively end up as random embeddings. Grouping them gives you some information.