cornac icon indicating copy to clipboard operation
cornac copied to clipboard

[BUG] Running EASE model with 600k ratings crashes with out of memory error

Open filippo-orru opened this issue 1 year ago • 9 comments

Description

I've been trying out various cornac models and had some success. However, running the EASE model on a training set with ~600k ratings never succeeds. Memory consumption starts at around 500MB, then quickly grows to ~60GB until, at some point, the process is killed. Activity Monitor 2024-12-25 18 18 28

In which platform does it happen?

MacOS 15.2 running on an M1 Pro with 16GB of memory.

How do we replicate the issue?

Minimal example:

import cornac
from cornac.eval_methods import RatioSplit
from cornac.models import EASE
from cornac.metrics import NDCG
import pandas as pd

path = "training_data/training_data_ratings_20241218_224839.parquet.snappy"
df_original = pd.read_parquet(path)
print("Loaded data")

# Convert dataframe to list of tuples (user_id, item_id, rating)
data = df_original[["userID", "itemID", "rating"]].values.tolist()
rs = RatioSplit(data, test_size=0.15, val_size=0.1, rating_threshold=3.0)
print(f"{len(data)} ratings: {rs.train_size} training and {rs.test_size} test")

ndcg = NDCG(k=10)
metrics = [ndcg]

ease = EASE()
models = [ease]

cornac.Experiment(eval_method=rs, models=models, metrics=metrics, user_based=True, verbose=True).run()

print("Done!")

ease.recommend("my_user_id", k=10) # Never reaches this point

Expected behavior (i.e. solution)

The experiment should run successfully and output the results. 600k training samples isn't that much, and the model is extremely simple. I don't see how it would need this much memory.

filippo-orru avatar Dec 25 '24 17:12 filippo-orru

Apparently the current implementation of EASE model is not very efficient with a large number of items. Could you share the number of items in your data? I would also suggest to filter out items with at least X ratings to start with.

qtuantruong avatar Dec 27 '24 18:12 qtuantruong

My data is quite skewed. Few users, each with lots of ratings: 'users': 73, 'items': 445049, 'interactions': 634437, 'interaction density': '1.95%'

I've read that EASE is not ideal for large datasets, so I tried using SANSA instead. They describe their model as "sparse EASE for millions of items". In this example jupyter notebook, they train SANSA on the Amazon Books dataset: 'users': 52643, 'items': 91599, 'interactions': 2984108, 'interaction density': '0.0619%'. I was able to successfully train the model using only a small part of colab's 16 GB of memory. I'm still trying to adapt the code to train on my own dataset.

Using cornac's RatioSplit as a dataloader (rs.train_set.csr_matrix) to train SANSA causes the same kind of growing memory consumption, so maybe part of the issue lies there?

filippo-orru avatar Dec 28 '24 09:12 filippo-orru

I would also suggest to filter out items with at least X ratings to start with.

Do you mean "don't include items with more than X ratings in the training set"?

filippo-orru avatar Dec 28 '24 09:12 filippo-orru

You mentioned that you've successfully run with other models in Cornac so it can't be the RatioSplit or train_set problem. We use sparse representation for the rating matrix so it's similar to SANSA in terms of data handling. The only problem is EASE implementation in Cornac is transforming an intermediate matrix of size item x item into a dense matrix which causes memory overhead. It's the same as the original implementation.

The point is if you want to use EASE model on your data, your best shot is to stick with SANSA. Or you're welcome to contribute the implementation of SANSA into Cornac so others can benefit from our eco ecosystem.

qtuantruong avatar Dec 28 '24 18:12 qtuantruong

I would also suggest to filter out items with at least X ratings to start with.

Do you mean "don't include items with more than X ratings in the training set"?

I mean you can try to reduce the size of your data by removing long-tail items. Not sure if that's what you want, just a suggestion in case they're not that important for your problem.

qtuantruong avatar Dec 28 '24 18:12 qtuantruong

@tqtg I'd be happy to contribute the implementation of SANSA to your library :). Since it's basically EASE, it more or less suffices to copy its interface and replace the fit function with core of SANSA, which I would add as an additional dependency.

matospiso avatar Jan 08 '25 17:01 matospiso

@filippo-orru

My data is quite skewed. Few users, each with lots of ratings: 'users': 73, 'items': 445049, 'interactions': 634437, 'interaction density': '1.95%'

The problem here is caused by the density of rows in your user-item matrix X. SANSA, like EASE, at one point computes the Gram matrix X.T @ X, and this matrix will have at least max_user_nonzeros ** 2 nonzero entries (you can see this if you draw the outer product of the user row with itself). My guess is you have at least one user with tens of thousands of interactions, maybe hundreds of thousands, and so you run out of memory.

Possible solutions:

  • delete users with >= c * 10000 interactions (bots maybe?)
  • split your training users into shorter (possibly overlapping) sessions

The goal is to reduce the density of X.T @ X so that it fits in memory, then SANSA will work :)

matospiso avatar Jan 08 '25 18:01 matospiso

My guess is you have at least one user with tens of thousands of interactions, maybe hundreds of thousands, and so you run out of memory.

You're right, but I'm pretty sure it's not a bot ^^


Thanks for the explanation, it makes the issue much clearer. I don't want to remove interactions by heavy users because I think they're quite valuable.

Here's a plot of the rating distribution for users (max user ratings: 26k) image

For now, I solved the issue by dropping items with less than 15 interactions. But of course that makes it very hard for niche items to be discovered; the cold start is worse than it already is.

Here is the rating distribution for items after dropping (max item ratings: 334) image

Do you think the splitting solution would be better? E.g. "split users with >5k ratings into batches of 2k ratings +1k randomly sampled overlapping ratings".

filippo-orru avatar Jan 08 '25 18:01 filippo-orru

Yes, I think splitting the users can work well. It really depends on your data and domain, but if you collected this data over a longer period of time, it might make sense to divide interactions of each user into overlapping 24-hour windows (or 7-day, 30-day, etc.).

  • Without splitting, every pair of co-interacted items (by some user) is considered connected and the model will learn this relationship
  • With splitting, items have to be co-interacted in the same window, otherwise the model will ignore this connection. This biases the model to focus more on short-term trends of user behavior instead of considering their entire history as a single chunk - and in many cases this is a good kind of bias to introduce (if I watch 2 videos on YouTube in one evening, those videos are much more likely to be related than a pair of videos watched a week apart) :)

matospiso avatar Jan 08 '25 19:01 matospiso