Merlin Raw id notebook Conversion how to

In this PR, I show how it is possible to store the raw id for both items and users in the feature store and use those ids where necessary within the pipeline.

Jul 18 '22 20:07 jperez999

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Jul 18 '22 20:07 review-notebook-app[bot]

Click to view CI Results

GitHub pull request #474 of commit 3bbcca6fd9a613df6b23c9934340d056e04c13d1, no merge conflicts.
Running as SYSTEM
Setting status of 3bbcca6fd9a613df6b23c9934340d056e04c13d1 to PENDING with url https://10.20.13.93:8080/job/merlin_merlin/267/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_merlin
using credential systems-login
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Merlin # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Merlin
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Merlin +refs/pull/474/*:refs/remotes/origin/pr/474/* # timeout=10
 > git rev-parse 3bbcca6fd9a613df6b23c9934340d056e04c13d1^{commit} # timeout=10
Checking out Revision 3bbcca6fd9a613df6b23c9934340d056e04c13d1 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 3bbcca6fd9a613df6b23c9934340d056e04c13d1 # timeout=10
Commit message: "raw_id setup for e2e"
 > git rev-list --no-walk dd36e3afd92da6d92cdab47fa4fbce43161d1c4b # timeout=10
[merlin_merlin] $ /bin/bash /tmp/jenkins12142910180240625167.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_merlin/merlin
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 2 items
tests/unit/test_version.py .                                             [ 50%]
tests/unit/examples/test_building_deploying_multi_stage_RecSys.py F      [100%]
=================================== FAILURES ===================================
__________________________________ test_func ___________________________________
def test_func():
    with testbook(
        REPO_ROOT
        / "examples"
        / "Building-and-deploying-multi-stage-RecSys"
        / "01-Building-Recommender-Systems-with-Merlin.ipynb",
        execute=False,
    ) as tb1:
        tb1.inject(
            """
            import os
            os.environ["DATA_FOLDER"] = "/tmp/data/"
            os.environ["NUM_ROWS"] = "10000"
            os.system("mkdir -p /tmp/examples")
            os.environ["BASE_DIR"] = "/tmp/examples/"
            """
        )
        tb1.execute()
        assert os.path.isdir("/tmp/examples/dlrm")
        assert os.path.isdir("/tmp/examples/feature_repo")
        assert os.path.isdir("/tmp/examples/query_tower")
        assert os.path.isfile("/tmp/examples/item_embeddings.parquet")
        assert os.path.isfile("/tmp/examples/feature_repo/user_features.py")
        assert os.path.isfile("/tmp/examples/feature_repo/item_features.py")

    with testbook(
        REPO_ROOT
        / "examples"
        / "Building-and-deploying-multi-stage-RecSys"
        / "02-Deploying-multi-stage-RecSys-with-Merlin-Systems.ipynb",
        execute=False,
    ) as tb2:
        tb2.inject(
            """
            import os
            os.environ["DATA_FOLDER"] = "/tmp/data/"
            os.environ["BASE_DIR"] = "/tmp/examples/"
            """
        )
        NUM_OF_CELLS = len(tb2.cells)
        tb2.execute_cell(list(range(0, NUM_OF_CELLS - 3)))
        top_k = tb2.ref("top_k")
        outputs = tb2.ref("outputs")


      assert outputs[0] == "ordered_ids"


E           AssertionError: assert 'item_category' == 'ordered_ids'
E             - ordered_ids
E             + item_category
tests/unit/examples/test_building_deploying_multi_stage_RecSys.py:56: AssertionError
----------------------------- Captured stderr call -----------------------------
2022-07-18 20:43:56.594199: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-18 20:43:58.563387: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1627 MB memory:  -> device: 0, name: Tesla P100-DGXS-16GB, pci bus id: 0000:07:00.0, compute capability: 6.0
2022-07-18 20:43:58.564229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 15153 MB memory:  -> device: 1, name: Tesla P100-DGXS-16GB, pci bus id: 0000:08:00.0, compute capability: 6.0
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/usr/lib/python3.8/logging/init.py", line 2127, in shutdown
h.close()
File "/usr/local/lib/python3.8/dist-packages/absl/logging/init.py", line 934, in close
self.stream.close()
File "/usr/local/lib/python3.8/dist-packages/ipykernel/iostream.py", line 438, in close
self.watch_fd_thread.join()
AttributeError: 'OutStream' object has no attribute 'watch_fd_thread'
WARNING clustering 436 points to 32 centroids: please provide at least 1248 training points
2022-07-18 20:45:23.893679: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-18 20:45:25.865196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1627 MB memory:  -> device: 0, name: Tesla P100-DGXS-16GB, pci bus id: 0000:07:00.0, compute capability: 6.0
2022-07-18 20:45:25.865968: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 15153 MB memory:  -> device: 1, name: Tesla P100-DGXS-16GB, pci bus id: 0000:08:00.0, compute capability: 6.0
=========================== short test summary info ============================
FAILED tests/unit/examples/test_building_deploying_multi_stage_RecSys.py::test_func
=================== 1 failed, 1 passed in 106.17s (0:01:46) ====================
Build step 'Execute shell' marked build as failure
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script  : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Merlin/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[merlin_merlin] $ /bin/bash /tmp/jenkins8488356159428892603.sh

Jul 18 '22 20:07 nvidia-merlin-bot

Documentation preview

https://nvidia-merlin.github.io/Merlin/review/pr-474

Jul 18 '22 20:07 github-actions[bot]

Click to view CI Results

GitHub pull request #474 of commit be6887569b8b662cfc299c284e11c0c1e1e23da6, no merge conflicts.
Running as SYSTEM
Setting status of be6887569b8b662cfc299c284e11c0c1e1e23da6 to PENDING with url https://10.20.13.93:8080/job/merlin_merlin/270/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_merlin
using credential systems-login
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Merlin # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Merlin
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Merlin +refs/pull/474/*:refs/remotes/origin/pr/474/* # timeout=10
 > git rev-parse be6887569b8b662cfc299c284e11c0c1e1e23da6^{commit} # timeout=10
Checking out Revision be6887569b8b662cfc299c284e11c0c1e1e23da6 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f be6887569b8b662cfc299c284e11c0c1e1e23da6 # timeout=10
Commit message: "Merge branch 'main' into raw-id-nb"
 > git rev-list --no-walk dd36e3afd92da6d92cdab47fa4fbce43161d1c4b # timeout=10
[merlin_merlin] $ /bin/bash /tmp/jenkins6907072761729827191.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_merlin/merlin
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 2 items
tests/unit/test_version.py .                                             [ 50%]
tests/unit/examples/test_building_deploying_multi_stage_RecSys.py F      [100%]
=================================== FAILURES ===================================
__________________________________ test_func ___________________________________
def test_func():
    with testbook(
        REPO_ROOT
        / "examples"
        / "Building-and-deploying-multi-stage-RecSys"
        / "01-Building-Recommender-Systems-with-Merlin.ipynb",
        execute=False,
    ) as tb1:
        tb1.inject(
            """
            import os
            os.environ["DATA_FOLDER"] = "/tmp/data/"
            os.environ["NUM_ROWS"] = "10000"
            os.system("mkdir -p /tmp/examples")
            os.environ["BASE_DIR"] = "/tmp/examples/"
            """
        )
        tb1.execute()
        assert os.path.isdir("/tmp/examples/dlrm")
        assert os.path.isdir("/tmp/examples/feature_repo")
        assert os.path.isdir("/tmp/examples/query_tower")
        assert os.path.isfile("/tmp/examples/item_embeddings.parquet")
        assert os.path.isfile("/tmp/examples/feature_repo/user_features.py")
        assert os.path.isfile("/tmp/examples/feature_repo/item_features.py")

    with testbook(
        REPO_ROOT
        / "examples"
        / "Building-and-deploying-multi-stage-RecSys"
        / "02-Deploying-multi-stage-RecSys-with-Merlin-Systems.ipynb",
        execute=False,
    ) as tb2:
        tb2.inject(
            """
            import os
            os.environ["DATA_FOLDER"] = "/tmp/data/"
            os.environ["BASE_DIR"] = "/tmp/examples/"
            """
        )
        NUM_OF_CELLS = len(tb2.cells)
        tb2.execute_cell(list(range(0, NUM_OF_CELLS - 3)))
        top_k = tb2.ref("top_k")
        outputs = tb2.ref("outputs")


      assert outputs[0] == "ordered_ids"


E           AssertionError: assert 'item_category' == 'ordered_ids'
E             - ordered_ids
E             + item_category
tests/unit/examples/test_building_deploying_multi_stage_RecSys.py:56: AssertionError
----------------------------- Captured stderr call -----------------------------
2022-07-21 14:10:33.879545: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-21 14:10:36.624048: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1627 MB memory:  -> device: 0, name: Tesla P100-DGXS-16GB, pci bus id: 0000:07:00.0, compute capability: 6.0
2022-07-21 14:10:36.624959: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 14532 MB memory:  -> device: 1, name: Tesla P100-DGXS-16GB, pci bus id: 0000:08:00.0, compute capability: 6.0
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/usr/lib/python3.8/logging/init.py", line 2127, in shutdown
h.close()
File "/usr/local/lib/python3.8/dist-packages/absl/logging/init.py", line 934, in close
self.stream.close()
File "/usr/local/lib/python3.8/dist-packages/ipykernel/iostream.py", line 438, in close
self.watch_fd_thread.join()
AttributeError: 'OutStream' object has no attribute 'watch_fd_thread'
WARNING clustering 455 points to 32 centroids: please provide at least 1248 training points
2022-07-21 14:11:58.278275: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-21 14:12:00.261309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1627 MB memory:  -> device: 0, name: Tesla P100-DGXS-16GB, pci bus id: 0000:07:00.0, compute capability: 6.0
2022-07-21 14:12:00.262039: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 15153 MB memory:  -> device: 1, name: Tesla P100-DGXS-16GB, pci bus id: 0000:08:00.0, compute capability: 6.0
=========================== short test summary info ============================
FAILED tests/unit/examples/test_building_deploying_multi_stage_RecSys.py::test_func
=================== 1 failed, 1 passed in 105.87s (0:01:45) ====================
Build step 'Execute shell' marked build as failure
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script  : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Merlin/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[merlin_merlin] $ /bin/bash /tmp/jenkins18299782403772605274.sh

Jul 21 '22 14:07 nvidia-merlin-bot

realted to https://github.com/NVIDIA-Merlin/Merlin/issues/458

Jul 25 '22 16:07 viswa-nvidia

It's hard to review a notebook, but in your cell with

top_k=10
ordering = combined_features["item_id"] >> SoftmaxSampling(
    relevance_col=ranking["click/binary_classification_task"], topk=top_k, temperature=20.0
)

You can replace item_id with item_id_raw since it's already in your FeatureView. This will produce the final ordered version of the "raw" item ids and remove the need for the final QueryFeast

top_k=10
ordering = combined_features["item_id_raw"] >> SoftmaxSampling(
    relevance_col=ranking["click/binary_classification_task"], topk=top_k, temperature=20.0
)

Jul 28 '22 20:07 nv-alaiacano

It's hard to review a notebook, but in your cell with
top_k=10
ordering = combined_features["item_id"] >> SoftmaxSampling(
    relevance_col=ranking["click/binary_classification_task"], topk=top_k, temperature=20.0
)
You can replace item_id with item_id_raw since it's already in your FeatureView. This will produce the final ordered version of the "raw" item ids and remove the need for the final QueryFeast
top_k=10
ordering = combined_features["item_id_raw"] >> SoftmaxSampling(
    relevance_col=ranking["click/binary_classification_task"], topk=top_k, temperature=20.0
)

@nv-alaiacano thanks for the comment. I am actually gonna close this PR and create a new one. I already modified the example notebooks around this PR, and used ordering = combined_features["item_id_raw"] as you and @karlhigley pointed out. But the issue is unit test fails on CI. You can check the CI error I get in this PR as well. https://github.com/NVIDIA-Merlin/Merlin/pull/487

Jul 28 '22 22:07 rnyak

       assert outputs[0] == "ordered_ids"
E AssertionError: assert 'item_category' == 'ordered_ids' E - ordered_ids E + item_category

Make sure you're running against the latest main branch of systems. I made a change in systems to not rename the output column to ordered_ids, and instead keep the name of the column that is actually being ordered (in this case, item_category). It ultimately got reverted for unrelated reasons, but it seems like you might be running a test against a version of systems while it was in there.

Aug 22 '22 19:08 nv-alaiacano

closing this, since we have another PR https://github.com/NVIDIA-Merlin/Merlin/pull/618

Sep 13 '22 21:09 rnyak