cornac icon indicating copy to clipboard operation
cornac copied to clipboard

Add Model Ensembling Tutorial

Open darrylong opened this issue 1 year ago • 5 comments

Description

In this PR, a model ensembling tutorial is added. This tutorial utilizes scikit-learn to perform ensembling on top of trained models on Cornac.

Related Issues

Checklist:

  • [ ] I have added tests.
  • [ ] I have updated the documentation accordingly.
  • [ ] I have updated README.md (if you are adding a new model).
  • [ ] I have updated examples/README.md (if you are adding a new example).
  • [ ] I have updated datasets/README.md (if you are adding a new dataset).

darrylong avatar Jul 24 '24 09:07 darrylong

Looking to receive feedback for implementation and structure of this tutorial.

Let me know on how we can improve this. Thanks!

darrylong avatar Jul 25 '24 04:07 darrylong

Thanks Darryl for the tutorial. My first suggestion is that we can start simple without the need of training any model, maybe based on a simple voting mechanism to identify top-K ranked list from two models (BPR and WMF). We can then use that as a baseline for more sophisticated ensembling techniques.

qtuantruong avatar Jul 25 '24 17:07 qtuantruong

Simplest bagging approach could be as follows:

  1. Train M recommender models (base models) with bootstrapping of training set (doesn't have to be different samples of the training set, we can try different random seeds to mimic this idea -- this will also help with different base models having the same set of users and items).
  2. For rating prediction, generate M rating predictions by using the base models and then combine the predictions for each item (e.g., average/sum, can be weighted sum if we have model preference -- prefer some models over the others).
  3. For ranking prediction, generate M recommendation lists of top-K items with the base models, combine the list (e.g., count).

qtuantruong avatar Jul 25 '24 17:07 qtuantruong

For a more sophisticated approach, think about this as a meta-learning problem. We treat predictions of M base models as input features for another meta-model to learn on top. This meta-model could be any ML model -- linear-regression/random-forests/etc... We can structure this part to be flexible so anyone could experiment with other libraries (e.g., scikit-learn, lightgbm, xgboost).

qtuantruong avatar Jul 25 '24 17:07 qtuantruong

Thanks Darryl. This looks great!

Here are some comments:

  • The model.rank() should be able to filter top_k using the k arg so we don't need to do it manually.
  • For Borda count, let's use the same language that we use in the example, i.e., try to simplify the tables only by showing the Allocated Points (N - rank) and not Rank and Inverse Rank.
  • For Section 3, combining multiple WMF models using Borda count method, let's not use inverse_rank anymore because it's difficult to understand. It's only valuable for explaining Borda count. At this point, let's assume that everyone understands the method, so we just show top-k recommendations and compare across models and the ensemble one.
  • Let's remove this explanation: Meta-learning, also called 'learning to learn', is a method to teach models to learn and adapt to new tasks. cause it's not what we're doing here.
  • In Section 4.1 Prepare Data, can we show both X_train and y_train in the same table?
  • [IMPORTANT] test_df for linear regression (or any other ML models) should be the full user-item matrix (not the test set only). The idea is that if we want to give recommendations for a user, we need to predict scores of all items for such user in oder to rank them, not just the items appear in test set of the WMF models. If the full user-item matrix is too big, we can illustrate how to give recommendations for one user, though we still need to predict for all items.

qtuantruong avatar Aug 20 '24 13:08 qtuantruong