[BUG] `Catetegorify` can't process `vocabs` correctly when `num_buckets>1`

Open fedaeho opened this issue 2 years ago • 1 comments

Describe the bug nvt.ops.Categorify don't process vocabs correctly when num_buckets>1 is given simultaneously.

Steps/Code to reproduce bug

I tried to use categorify transform with pre-defined vocabs. I also have to consider multiple oov, so I also gives num_buckets>1 for parameter.

from merlin.core import dispatch
import pandas as pd
import nvtabular as nvt

df = dispatch.make_df(
        {
            "Authors": [["User_A"], ["User_A", "User_E"], ["User_B", "User_C"], []],
            "Post": [1, 2, 3, 4],
        }
    )

cat_names = ["Authors"]
label_name = ["Post"]

vocabs = {"Authors": pd.Series([f"User_{x}" for x in "ACBE"])}
cat_features = cat_names >> nvt.ops.Categorify(
    num_buckets=2, vocabs=vocabs, max_size = {"Authors": 8},
)

workflow = nvt.Workflow(cat_features + label_name)
df_out = workflow.fit_transform(nvt.Dataset(df)).to_ddf().compute()

For above code, expected index for each values are like below.

pad: [0]
null: [1]
oov : [2,3]
unique: [4,5,6,7]).

But, I get following result with wrong category dictionary.

df_out

	Authors	Post
0	[7]	1
1	[ 7 10]	2
2	[9 8]	3
3	[]	4

pd.read_parquet("./categories/meta.Authors.parquet")

	kind	offset	num_indices
0	pad	0	1
1	null	1	1
2	oov	2	1
3	unique	3	4

pd.read_parquet("./categories/unique.Authors.parquet")

	Authors
3	User_A
4	User_C
5	User_B
6	User_E

I check inside of Categorify.process_vocabs function and oov_count can get num_buckets correctly. But when process_vocabs function call Categorify._save_encodings(), it doesn't make the vocabulary dictionary correctly.

Expected behavior From https://github.com/NVIDIA-Merlin/NVTabular/blob/77b94a40babfea160130c70160dfdf60356b4f16/nvtabular/ops/categorify.py#L432-L438

I fix the code whereprocess_vocab call Categorify._save_encodings with oov_count.

    def process_vocabs(self, vocabs):
      ...
                oov_count = 1
                if num_buckets:
                    oov_count = (
                        num_buckets if isinstance(num_buckets, int) else num_buckets[col_name]
                    ) or 1
                col_df = dispatch.make_df(vals).dropna()
                col_df.index += NULL_OFFSET + oov_count
                # before
                # save_path = _save_encodings(col_df, base_path, col_name, oov_count=oov_count)
                # after
                save_path = _save_encodings(col_df, base_path, col_name, oov_count=oov_count)

and I got following result of df_out like as I expected.

	Authors	Post
0	[4]	1
1	[ 4 7]	2
2	[6 5]	3
3	[]	4

Environment details (please complete the following information):

Environment location: Bare-metal (CentOS 7)
Method of NVTabular install: pip

Additional context None

Aug 01 '23 05:08 fedaeho

In all of the applications I've built OOV has been a single embedding and used to represent the fact that the item is new or rare. Can you help me understand the use case? Why would you want multiple OOV values. They're so rare that they'll effectively end up as random embeddings. Grouping them gives you some information.

Sep 11 '23 22:09 EvenOldridge