[BUG] Running EASE model with 600k ratings crashes with out of memory error
Description
I've been trying out various cornac models and had some success. However, running the EASE model on a training set with ~600k ratings never succeeds. Memory consumption starts at around 500MB, then quickly grows to ~60GB until, at some point, the process is killed.
In which platform does it happen?
MacOS 15.2 running on an M1 Pro with 16GB of memory.
How do we replicate the issue?
Minimal example:
import cornac
from cornac.eval_methods import RatioSplit
from cornac.models import EASE
from cornac.metrics import NDCG
import pandas as pd
path = "training_data/training_data_ratings_20241218_224839.parquet.snappy"
df_original = pd.read_parquet(path)
print("Loaded data")
# Convert dataframe to list of tuples (user_id, item_id, rating)
data = df_original[["userID", "itemID", "rating"]].values.tolist()
rs = RatioSplit(data, test_size=0.15, val_size=0.1, rating_threshold=3.0)
print(f"{len(data)} ratings: {rs.train_size} training and {rs.test_size} test")
ndcg = NDCG(k=10)
metrics = [ndcg]
ease = EASE()
models = [ease]
cornac.Experiment(eval_method=rs, models=models, metrics=metrics, user_based=True, verbose=True).run()
print("Done!")
ease.recommend("my_user_id", k=10) # Never reaches this point
Expected behavior (i.e. solution)
The experiment should run successfully and output the results. 600k training samples isn't that much, and the model is extremely simple. I don't see how it would need this much memory.
Apparently the current implementation of EASE model is not very efficient with a large number of items. Could you share the number of items in your data? I would also suggest to filter out items with at least X ratings to start with.
My data is quite skewed. Few users, each with lots of ratings:
'users': 73, 'items': 445049, 'interactions': 634437, 'interaction density': '1.95%'
I've read that EASE is not ideal for large datasets, so I tried using SANSA instead. They describe their model as "sparse EASE for millions of items".
In this example jupyter notebook, they train SANSA on the Amazon Books dataset:
'users': 52643, 'items': 91599, 'interactions': 2984108, 'interaction density': '0.0619%'.
I was able to successfully train the model using only a small part of colab's 16 GB of memory.
I'm still trying to adapt the code to train on my own dataset.
Using cornac's RatioSplit as a dataloader (rs.train_set.csr_matrix) to train SANSA causes the same kind of growing memory consumption, so maybe part of the issue lies there?
I would also suggest to filter out items with at least X ratings to start with.
Do you mean "don't include items with more than X ratings in the training set"?
You mentioned that you've successfully run with other models in Cornac so it can't be the RatioSplit or train_set problem. We use sparse representation for the rating matrix so it's similar to SANSA in terms of data handling. The only problem is EASE implementation in Cornac is transforming an intermediate matrix of size item x item into a dense matrix which causes memory overhead. It's the same as the original implementation.
The point is if you want to use EASE model on your data, your best shot is to stick with SANSA. Or you're welcome to contribute the implementation of SANSA into Cornac so others can benefit from our eco ecosystem.
I would also suggest to filter out items with at least X ratings to start with.
Do you mean "don't include items with more than X ratings in the training set"?
I mean you can try to reduce the size of your data by removing long-tail items. Not sure if that's what you want, just a suggestion in case they're not that important for your problem.
@tqtg I'd be happy to contribute the implementation of SANSA to your library :). Since it's basically EASE, it more or less suffices to copy its interface and replace the fit function with core of SANSA, which I would add as an additional dependency.
@filippo-orru
My data is quite skewed. Few users, each with lots of ratings:
'users': 73, 'items': 445049, 'interactions': 634437, 'interaction density': '1.95%'
The problem here is caused by the density of rows in your user-item matrix X. SANSA, like EASE, at one point computes the Gram matrix X.T @ X, and this matrix will have at least max_user_nonzeros ** 2 nonzero entries (you can see this if you draw the outer product of the user row with itself). My guess is you have at least one user with tens of thousands of interactions, maybe hundreds of thousands, and so you run out of memory.
Possible solutions:
- delete users with
>= c * 10000interactions (bots maybe?) - split your training users into shorter (possibly overlapping) sessions
The goal is to reduce the density of X.T @ X so that it fits in memory, then SANSA will work :)
My guess is you have at least one user with tens of thousands of interactions, maybe hundreds of thousands, and so you run out of memory.
You're right, but I'm pretty sure it's not a bot ^^
Thanks for the explanation, it makes the issue much clearer. I don't want to remove interactions by heavy users because I think they're quite valuable.
Here's a plot of the rating distribution for users (max user ratings: 26k)
For now, I solved the issue by dropping items with less than 15 interactions. But of course that makes it very hard for niche items to be discovered; the cold start is worse than it already is.
Here is the rating distribution for items after dropping (max item ratings: 334)
Do you think the splitting solution would be better? E.g. "split users with >5k ratings into batches of 2k ratings +1k randomly sampled overlapping ratings".
Yes, I think splitting the users can work well. It really depends on your data and domain, but if you collected this data over a longer period of time, it might make sense to divide interactions of each user into overlapping 24-hour windows (or 7-day, 30-day, etc.).
- Without splitting, every pair of co-interacted items (by some user) is considered connected and the model will learn this relationship
- With splitting, items have to be co-interacted in the same window, otherwise the model will ignore this connection. This biases the model to focus more on short-term trends of user behavior instead of considering their entire history as a single chunk - and in many cases this is a good kind of bias to introduce (if I watch 2 videos on YouTube in one evening, those videos are much more likely to be related than a pair of videos watched a week apart) :)