Merlin icon indicating copy to clipboard operation
Merlin copied to clipboard

Raw id notebook Conversion how to

Open jperez999 opened this issue 3 years ago • 7 comments

In this PR, I show how it is possible to store the raw id for both items and users in the feature store and use those ids where necessary within the pipeline.

jperez999 avatar Jul 18 '22 20:07 jperez999

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Click to view CI Results
GitHub pull request #474 of commit 3bbcca6fd9a613df6b23c9934340d056e04c13d1, no merge conflicts.
Running as SYSTEM
Setting status of 3bbcca6fd9a613df6b23c9934340d056e04c13d1 to PENDING with url https://10.20.13.93:8080/job/merlin_merlin/267/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_merlin
using credential systems-login
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Merlin # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Merlin
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Merlin +refs/pull/474/*:refs/remotes/origin/pr/474/* # timeout=10
 > git rev-parse 3bbcca6fd9a613df6b23c9934340d056e04c13d1^{commit} # timeout=10
Checking out Revision 3bbcca6fd9a613df6b23c9934340d056e04c13d1 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 3bbcca6fd9a613df6b23c9934340d056e04c13d1 # timeout=10
Commit message: "raw_id setup for e2e"
 > git rev-list --no-walk dd36e3afd92da6d92cdab47fa4fbce43161d1c4b # timeout=10
[merlin_merlin] $ /bin/bash /tmp/jenkins12142910180240625167.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_merlin/merlin
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 2 items

tests/unit/test_version.py . [ 50%] tests/unit/examples/test_building_deploying_multi_stage_RecSys.py F [100%]

=================================== FAILURES =================================== __________________________________ test_func ___________________________________

def test_func():
    with testbook(
        REPO_ROOT
        / "examples"
        / "Building-and-deploying-multi-stage-RecSys"
        / "01-Building-Recommender-Systems-with-Merlin.ipynb",
        execute=False,
    ) as tb1:
        tb1.inject(
            """
            import os
            os.environ["DATA_FOLDER"] = "/tmp/data/"
            os.environ["NUM_ROWS"] = "10000"
            os.system("mkdir -p /tmp/examples")
            os.environ["BASE_DIR"] = "/tmp/examples/"
            """
        )
        tb1.execute()
        assert os.path.isdir("/tmp/examples/dlrm")
        assert os.path.isdir("/tmp/examples/feature_repo")
        assert os.path.isdir("/tmp/examples/query_tower")
        assert os.path.isfile("/tmp/examples/item_embeddings.parquet")
        assert os.path.isfile("/tmp/examples/feature_repo/user_features.py")
        assert os.path.isfile("/tmp/examples/feature_repo/item_features.py")

    with testbook(
        REPO_ROOT
        / "examples"
        / "Building-and-deploying-multi-stage-RecSys"
        / "02-Deploying-multi-stage-RecSys-with-Merlin-Systems.ipynb",
        execute=False,
    ) as tb2:
        tb2.inject(
            """
            import os
            os.environ["DATA_FOLDER"] = "/tmp/data/"
            os.environ["BASE_DIR"] = "/tmp/examples/"
            """
        )
        NUM_OF_CELLS = len(tb2.cells)
        tb2.execute_cell(list(range(0, NUM_OF_CELLS - 3)))
        top_k = tb2.ref("top_k")
        outputs = tb2.ref("outputs")
      assert outputs[0] == "ordered_ids"

E AssertionError: assert 'item_category' == 'ordered_ids' E - ordered_ids E + item_category

tests/unit/examples/test_building_deploying_multi_stage_RecSys.py:56: AssertionError ----------------------------- Captured stderr call ----------------------------- 2022-07-18 20:43:56.594199: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-07-18 20:43:58.563387: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1627 MB memory: -> device: 0, name: Tesla P100-DGXS-16GB, pci bus id: 0000:07:00.0, compute capability: 6.0 2022-07-18 20:43:58.564229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 15153 MB memory: -> device: 1, name: Tesla P100-DGXS-16GB, pci bus id: 0000:08:00.0, compute capability: 6.0 Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/usr/lib/python3.8/logging/init.py", line 2127, in shutdown h.close() File "/usr/local/lib/python3.8/dist-packages/absl/logging/init.py", line 934, in close self.stream.close() File "/usr/local/lib/python3.8/dist-packages/ipykernel/iostream.py", line 438, in close self.watch_fd_thread.join() AttributeError: 'OutStream' object has no attribute 'watch_fd_thread' WARNING clustering 436 points to 32 centroids: please provide at least 1248 training points 2022-07-18 20:45:23.893679: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-07-18 20:45:25.865196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1627 MB memory: -> device: 0, name: Tesla P100-DGXS-16GB, pci bus id: 0000:07:00.0, compute capability: 6.0 2022-07-18 20:45:25.865968: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 15153 MB memory: -> device: 1, name: Tesla P100-DGXS-16GB, pci bus id: 0000:08:00.0, compute capability: 6.0 =========================== short test summary info ============================ FAILED tests/unit/examples/test_building_deploying_multi_stage_RecSys.py::test_func =================== 1 failed, 1 passed in 106.17s (0:01:46) ==================== Build step 'Execute shell' marked build as failure Performing Post build task... Match found for : : True Logical operation result is TRUE Running script : #!/bin/bash cd /var/jenkins_home/ CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Merlin/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log" [merlin_merlin] $ /bin/bash /tmp/jenkins8488356159428892603.sh

nvidia-merlin-bot avatar Jul 18 '22 20:07 nvidia-merlin-bot

Click to view CI Results
GitHub pull request #474 of commit be6887569b8b662cfc299c284e11c0c1e1e23da6, no merge conflicts.
Running as SYSTEM
Setting status of be6887569b8b662cfc299c284e11c0c1e1e23da6 to PENDING with url https://10.20.13.93:8080/job/merlin_merlin/270/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_merlin
using credential systems-login
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Merlin # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Merlin
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Merlin +refs/pull/474/*:refs/remotes/origin/pr/474/* # timeout=10
 > git rev-parse be6887569b8b662cfc299c284e11c0c1e1e23da6^{commit} # timeout=10
Checking out Revision be6887569b8b662cfc299c284e11c0c1e1e23da6 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f be6887569b8b662cfc299c284e11c0c1e1e23da6 # timeout=10
Commit message: "Merge branch 'main' into raw-id-nb"
 > git rev-list --no-walk dd36e3afd92da6d92cdab47fa4fbce43161d1c4b # timeout=10
[merlin_merlin] $ /bin/bash /tmp/jenkins6907072761729827191.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_merlin/merlin
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 2 items

tests/unit/test_version.py . [ 50%] tests/unit/examples/test_building_deploying_multi_stage_RecSys.py F [100%]

=================================== FAILURES =================================== __________________________________ test_func ___________________________________

def test_func():
    with testbook(
        REPO_ROOT
        / "examples"
        / "Building-and-deploying-multi-stage-RecSys"
        / "01-Building-Recommender-Systems-with-Merlin.ipynb",
        execute=False,
    ) as tb1:
        tb1.inject(
            """
            import os
            os.environ["DATA_FOLDER"] = "/tmp/data/"
            os.environ["NUM_ROWS"] = "10000"
            os.system("mkdir -p /tmp/examples")
            os.environ["BASE_DIR"] = "/tmp/examples/"
            """
        )
        tb1.execute()
        assert os.path.isdir("/tmp/examples/dlrm")
        assert os.path.isdir("/tmp/examples/feature_repo")
        assert os.path.isdir("/tmp/examples/query_tower")
        assert os.path.isfile("/tmp/examples/item_embeddings.parquet")
        assert os.path.isfile("/tmp/examples/feature_repo/user_features.py")
        assert os.path.isfile("/tmp/examples/feature_repo/item_features.py")

    with testbook(
        REPO_ROOT
        / "examples"
        / "Building-and-deploying-multi-stage-RecSys"
        / "02-Deploying-multi-stage-RecSys-with-Merlin-Systems.ipynb",
        execute=False,
    ) as tb2:
        tb2.inject(
            """
            import os
            os.environ["DATA_FOLDER"] = "/tmp/data/"
            os.environ["BASE_DIR"] = "/tmp/examples/"
            """
        )
        NUM_OF_CELLS = len(tb2.cells)
        tb2.execute_cell(list(range(0, NUM_OF_CELLS - 3)))
        top_k = tb2.ref("top_k")
        outputs = tb2.ref("outputs")
      assert outputs[0] == "ordered_ids"

E AssertionError: assert 'item_category' == 'ordered_ids' E - ordered_ids E + item_category

tests/unit/examples/test_building_deploying_multi_stage_RecSys.py:56: AssertionError ----------------------------- Captured stderr call ----------------------------- 2022-07-21 14:10:33.879545: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-07-21 14:10:36.624048: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1627 MB memory: -> device: 0, name: Tesla P100-DGXS-16GB, pci bus id: 0000:07:00.0, compute capability: 6.0 2022-07-21 14:10:36.624959: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 14532 MB memory: -> device: 1, name: Tesla P100-DGXS-16GB, pci bus id: 0000:08:00.0, compute capability: 6.0 Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/usr/lib/python3.8/logging/init.py", line 2127, in shutdown h.close() File "/usr/local/lib/python3.8/dist-packages/absl/logging/init.py", line 934, in close self.stream.close() File "/usr/local/lib/python3.8/dist-packages/ipykernel/iostream.py", line 438, in close self.watch_fd_thread.join() AttributeError: 'OutStream' object has no attribute 'watch_fd_thread' WARNING clustering 455 points to 32 centroids: please provide at least 1248 training points 2022-07-21 14:11:58.278275: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-07-21 14:12:00.261309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1627 MB memory: -> device: 0, name: Tesla P100-DGXS-16GB, pci bus id: 0000:07:00.0, compute capability: 6.0 2022-07-21 14:12:00.262039: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 15153 MB memory: -> device: 1, name: Tesla P100-DGXS-16GB, pci bus id: 0000:08:00.0, compute capability: 6.0 =========================== short test summary info ============================ FAILED tests/unit/examples/test_building_deploying_multi_stage_RecSys.py::test_func =================== 1 failed, 1 passed in 105.87s (0:01:45) ==================== Build step 'Execute shell' marked build as failure Performing Post build task... Match found for : : True Logical operation result is TRUE Running script : #!/bin/bash cd /var/jenkins_home/ CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Merlin/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log" [merlin_merlin] $ /bin/bash /tmp/jenkins18299782403772605274.sh

nvidia-merlin-bot avatar Jul 21 '22 14:07 nvidia-merlin-bot

realted to https://github.com/NVIDIA-Merlin/Merlin/issues/458

viswa-nvidia avatar Jul 25 '22 16:07 viswa-nvidia

It's hard to review a notebook, but in your cell with

top_k=10
ordering = combined_features["item_id"] >> SoftmaxSampling(
    relevance_col=ranking["click/binary_classification_task"], topk=top_k, temperature=20.0
)

You can replace item_id with item_id_raw since it's already in your FeatureView. This will produce the final ordered version of the "raw" item ids and remove the need for the final QueryFeast

top_k=10
ordering = combined_features["item_id_raw"] >> SoftmaxSampling(
    relevance_col=ranking["click/binary_classification_task"], topk=top_k, temperature=20.0
)

nv-alaiacano avatar Jul 28 '22 20:07 nv-alaiacano

It's hard to review a notebook, but in your cell with

top_k=10
ordering = combined_features["item_id"] >> SoftmaxSampling(
    relevance_col=ranking["click/binary_classification_task"], topk=top_k, temperature=20.0
)

You can replace item_id with item_id_raw since it's already in your FeatureView. This will produce the final ordered version of the "raw" item ids and remove the need for the final QueryFeast

top_k=10
ordering = combined_features["item_id_raw"] >> SoftmaxSampling(
    relevance_col=ranking["click/binary_classification_task"], topk=top_k, temperature=20.0
)

@nv-alaiacano thanks for the comment. I am actually gonna close this PR and create a new one. I already modified the example notebooks around this PR, and used ordering = combined_features["item_id_raw"] as you and @karlhigley pointed out. But the issue is unit test fails on CI. You can check the CI error I get in this PR as well. https://github.com/NVIDIA-Merlin/Merlin/pull/487

rnyak avatar Jul 28 '22 22:07 rnyak

       assert outputs[0] == "ordered_ids"

E AssertionError: assert 'item_category' == 'ordered_ids' E - ordered_ids E + item_category

Make sure you're running against the latest main branch of systems. I made a change in systems to not rename the output column to ordered_ids, and instead keep the name of the column that is actually being ordered (in this case, item_category). It ultimately got reverted for unrelated reasons, but it seems like you might be running a test against a version of systems while it was in there.

nv-alaiacano avatar Aug 22 '22 19:08 nv-alaiacano

closing this, since we have another PR https://github.com/NVIDIA-Merlin/Merlin/pull/618

rnyak avatar Sep 13 '22 21:09 rnyak