torchdrug icon indicating copy to clipboard operation
torchdrug copied to clipboard

Statistics and data splitting scheme on PPI datasets

Open sduzxj opened this issue 2 years ago • 2 comments

Using the following code, I will get different statistics than the documentation[1, 2], how do I get the preprocessed dataset or the code about the dataset splitting scheme [2]? code: import torch import torchdrug from torchdrug import datasets from torchdrug import core, datasets, tasks, models, layers

from torchdrug.datasets import HumanPPI,YeastPPI,PPIAffinity,Fold from torchdrug import data, utils from torchdrug import transforms as T

dataset = YeastPPI('./dataset/PPI', lazy =True)#, #transform=transforms) train_set,valid_set,test_set =dataset.split(['train', 'valid', 'test'])#.split()

print(len(train_set)) print(len(valid_set)) print(len(test_set)) output statistics : 2421, 203, 326

[1] https://torchdrug.ai/docs/api/datasets.html [2] https://torchprotein.ai/benchmark#leaderboard-for-yeast-ppi-prediction

sduzxj avatar Nov 09 '23 06:11 sduzxj

I have the same issue with subcellular localization dataset. The number of samples in the training, validation and test sets are different compared to the PEER paper. What is the problem? Is there any additional post-prosessing step we need to do on the dataset?

mahdip72 avatar Nov 13 '23 06:11 mahdip72

@sduzxj

I ran your code and got different numbers: 4945, 95, 394

mahdip72 avatar Nov 20 '23 07:11 mahdip72