Category: A2; Team name: TG; Dataset: Simplicial PPI (HIGH-PPI SHS27k + CORUM)
Checklist
- [x] My pull request has a clear and explanatory title.
- [x] My pull request passes the Linting test.
- [x] I added appropriate unit tests and I made sure the code passes all unit tests.
- [x] My PR follows PEP8 guidelines.
- [x] My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
- [x] I linked to issues and PRs that are relevant to this PR.
- [ ] "Official" HIGH-PPI splits
Description
This PR introduces the HIGH-PPI SHS27k + CORUM dataset to TopoBench - a natively higher-order simplicial complex dataset combining protein-protein interaction networks with experimentally validated protein complexes.
Note: This PR focuses on dataset integration and infrastructure. The training pipeline for higher-order prediction tasks will be added in my B.2 submission.
Dataset Structure:
- 0-cells: 1,553 human proteins
- 1-cells: 6,660 protein-protein interactions with typed edges (7 interaction types: reaction, binding, ptmod, activation, inhibition, catalysis, expression) plus confidence scores
- 2+ cells: ~470 experimentally validated protein complexes from CORUM
Simplicial Complex Construction:
- Add proteins as 0-cells
- Add HIGH-PPI edges as 1-cells with 8-dim feature vectors (7 interaction types + 1 confidence score)
-
Process CORUM complexes (top-down, largest first):
- Add each complex to the simplicial complex (automatically creates all sub-faces via TopoNetX)
- Mark the complex as positive (+1)
- Mark all proper sub-faces as negative (-1) unless already labeled by a smaller CORUM complex
- Boost confidence to 1.0 for edges within CORUM complexes (both HIGH-PPI and CORUM-only edges)
- Generate random negative samples proportionally for higher-order cells (rank ≥2) to balance the dataset
Supported Prediction Tasks (configured, training pipeline in B.2):
- Edge score regression: Predict confidence of protein-protein interactions (0-1 continuous)
- Edge interaction type classification: Multi-label prediction of 7 interaction types per edge
- Higher-order complex prediction: Binary classification of whether a protein set forms a real complex (2+ order cells)
Additional context
Data Sources:
- HIGH-PPI SHS27k: Human protein interaction network with typed edges (Paper)
- CORUM: Comprehensive Resource of Mammalian protein complexes database (experimentally validated)
Configuration Options:
-
min_complex_size/max_complex_size: Control which CORUM complexes to include (default: 2-6) -
target_ranks: List of ranks to predict on (supports single or multi-rank prediction) -
neg_ratio: Negative sample ratio for complex classification -
edge_task: Choose between"score"(regression) or"interaction_type"(classification)
Dear Participants,
This is a final reminder regarding the upcoming challenge deadline.
📅 Deadline: Tomorrow, 25th November 2025
✅ Critical Requirement: Please ensure your branch is passing all CI/CD tests.
If you have any pending changes, please push them and verify your build status as soon as possible.
Good luck!