Category: A1 Team name: Loris; Dataset: Graphland Benchmark and WikiCS dataset
Checklist
- [x] My pull request has a clear and explanatory title.
- [x] My pull request passes the Linting test.
- [x] I added appropriate unit tests and I made sure the code passes all unit tests. (refer to comment below)
- [x] My PR follows PEP8 guidelines. (refer to comment below)
- [x] My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
- [ ] I linked to issues and PRs that are relevant to this PR.
Pull Request: Integration of GraphLand Benchmark Datasets
This pull request integrates the datasets from the GraphLand benchmark into the repository. Specifically, I have added:
- ✅ Implementation of the dataset class, that implements torch_geometric.data.InMemoryDataset class.
- ✅ A dataloader that implements AbstractLoader class.
- ✅ A Zenodo download class to fetch the datasets, that handle the download.
- ✅ A dedicated YAML configuration file for each dataset.
GraphLand is a benchmark of 14 different graph datasets for predicting node properties in a wide range of industrial applications. GraphLand allows you to evaluate graph ML models on graphs of different sizes, structures, and feature sets, all in a unified environment. Furthermore, GraphLand focus on previously unexplored research questions, such as the extent to which realistic temporal distributional changes in transductive and inductive settings affect the performance of graph ML models.
Furthermore, this pull request introduces both a configuration file and a dataloader for the Wiki-CS dataset . The dataset comprises nodes representing computer science articles, with edges defined by hyperlinks between them, and includes 10 classes corresponding to distinct subfields of computer science.
Reference:
Gleb Bazhenov, Oleg Platonov, Liudmila Prokhorenkova (2025). GraphLand: A Landscape of Benchmark Datasets for Graph Machine Learning. arXiv:2409.14500. https://arxiv.org/abs/2409.14500 Peter Mernyei, Catalina Cangea (2022). Wiki-CS: A Wikipedia-Based Benchmark for Graph Neural Networks .https://arxiv.org/abs/2007.02901
Additional context
While implementing this integration, I identified some limitations in the current framework that may require further development:
- Dataset organization: Currently, the framework does not support creating subfolders in datasets/graph/ to better organize YAML configuration files.I suggest adding this functionality to improve maintainability as the number of datasets grows.
- Missing labels (semi-supervised setting):The GraphLand datasets contain nodes with missing labels. TopoBench currently does not support semi-supervised settings. As a workaround, I added a drop_missing_y flag to remove nodes without labels.A more robust solution would be to handle missing labels during split creation.
- Missing values in node features: The datasets include missing values in node features. I implemented the default imputation strategy used in the GraphLand paper (most frequent imputation). However, to avoid data leakage, imputation should ideally be applied after splitting (fit on training, transform on test). This feature is currently not supported in TopoBench.
Hi @Loris697! Did you fill out the required Google Form with the information of your PR? We don't find an entry assigned to your PR.
Thank you!
Hi @gbg141, my apologies. I was sure I had already completed the required Google Form. I’ve filled it out now.