Consultation on the dataset preprocessing part

Open Cinderella1001 opened this issue 3 years ago • 1 comments

I wonder that how the datasets item_vector.npy (dimension (38342, 150)) and user_vector.npy (dimension is (17237, 150)) are obtained. Looking forward to your reply, I will be very grateful.

Aug 12 '22 01:08 Cinderella1001

Hi, thank you for your attention to our works, there are three steps to generate the item vector matrix and user vector matrix, following are the details, hope it can help you:

Utilizing the gensim toolkit to train the word2vec(dimension=150, iter_epoch=10, other parameters are set by default) model by treating the Yelp review text as training data. Then you can get a vocabulary embedding matrix.
Represent each Yelp review with the words' embedding(getting the yelp review embedding by average pooling operation), and you can get the representations(dimension should also be 150) of each review.
Last, for each user or item, collect their corresponding review first, then represent each item or user with their reviews' representations by the average pooling operation.

Aug 12 '22 02:08 PeiJieSun