systems icon indicating copy to clipboard operation
systems copied to clipboard

[BUG] Robust 2-stage recommender system pipeline

Open bschifferer opened this issue 3 years ago • 3 comments

Bug description

The unit test of the 2-stage recommender system pipeline is shaky due to multiple reasons:

  • user_id sent to triton inference server does not exist in FEAST storage
  • FIASS cannot return k valid candidates given the user query: -- FIASS will return k-candidates, but filled up with -1 for not found candidates -- -1 cannot be processed by FEAST -- Issue is that FIASS has not enough item vectors to generate an index. Even 256 item_embeddings could result in less than 100 candidates

Unit test: https://github.com/NVIDIA-Merlin/Merlin/blob/main/tests/unit/examples/test_building_deploying_multi_stage_RecSys.py

Edge cases, we should be handling without crashing the systems:

  • user_id is not available in FEAST
  • user requests more topk than items in FIASS indexed (n): topk>FIASS
  • FIASS cannot return k-th valid candidates, even topk<n
  • FIASS returns item_ids which are not available in FEAST for futher processing
  • Candidates IDs are not availble in FEAST
  • Number of candidates are less then requested topk

What should be the result in each of the cases?

bschifferer avatar Sep 22 '22 09:09 bschifferer

thanks @bschifferer. these are all valid points. can we also add nulls issue to this list? integration test fails if we have nulls in the user id and item id columns in the real dataset.

rnyak avatar Sep 22 '22 12:09 rnyak

Changed priority to P1. Refer https://nvidia.slack.com/archives/C01RP7T89PY/p1663872124879779?thread_ts=1663843219.331779&cid=C01RP7T89PY

viswa-nvidia avatar Sep 22 '22 19:09 viswa-nvidia

Just for context on how we got here:

  • The Merlin 1.0 launch created a need to be able to at least tell a story about how serving would work, so we built the multi-stage example and put exactly enough code behind it to make that notebook usually run but not much else.
  • Session-based has taken a lot of development bandwidth that could otherwise have been allocated to this stuff and directed it elsewhere. Additionally, there's been a significant lack of clarity around how session-based models would fit into multi-stage recommenders, so that work hasn't overlapped with this as much as it otherwise might have.
  • We've spent large chunks of the past year trying to figure out how to get the pieces of Merlin to work together more smoothly, which has involved a lot of Merlin Core development on the part of the Systems devs.

I agree that this stuff is important though, and we might soon have bandwidth to tackle it, once we get session-based serving for both TF and Torch ironed out. Maybe in the 23.04-23.05 timeline?

karlhigley avatar Mar 22 '23 17:03 karlhigley