Category: A2; Team name: TG; Dataset: Simplicial PPI (HIGH-PPI SHS27k + CORUM)

Open grapentt opened this issue 2 months ago • 1 comments

Checklist

[x] My pull request has a clear and explanatory title.
[x] My pull request passes the Linting test.
[x] I added appropriate unit tests and I made sure the code passes all unit tests.
[x] My PR follows PEP8 guidelines.
[x] My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
[x] I linked to issues and PRs that are relevant to this PR.
[ ] "Official" HIGH-PPI splits

Description

This PR introduces the HIGH-PPI SHS27k + CORUM dataset to TopoBench - a natively higher-order simplicial complex dataset combining protein-protein interaction networks with experimentally validated protein complexes.

Note: This PR focuses on dataset integration and infrastructure. The training pipeline for higher-order prediction tasks will be added in my B.2 submission.

Dataset Structure:

0-cells: 1,553 human proteins
1-cells: 6,660 protein-protein interactions with typed edges (7 interaction types: reaction, binding, ptmod, activation, inhibition, catalysis, expression) plus confidence scores
2+ cells: ~470 experimentally validated protein complexes from CORUM

Simplicial Complex Construction:

Add proteins as 0-cells
Add HIGH-PPI edges as 1-cells with 8-dim feature vectors (7 interaction types + 1 confidence score)
Process CORUM complexes (top-down, largest first):
- Add each complex to the simplicial complex (automatically creates all sub-faces via TopoNetX)
- Mark the complex as positive (+1)
- Mark all proper sub-faces as negative (-1) unless already labeled by a smaller CORUM complex
- Boost confidence to 1.0 for edges within CORUM complexes (both HIGH-PPI and CORUM-only edges)
Generate random negative samples proportionally for higher-order cells (rank ≥2) to balance the dataset

Supported Prediction Tasks (configured, training pipeline in B.2):

Edge score regression: Predict confidence of protein-protein interactions (0-1 continuous)
Edge interaction type classification: Multi-label prediction of 7 interaction types per edge
Higher-order complex prediction: Binary classification of whether a protein set forms a real complex (2+ order cells)

Additional context

Data Sources:

HIGH-PPI SHS27k: Human protein interaction network with typed edges (Paper)
CORUM: Comprehensive Resource of Mammalian protein complexes database (experimentally validated)

Configuration Options:

min_complex_size / max_complex_size: Control which CORUM complexes to include (default: 2-6)
target_ranks: List of ranks to predict on (supports single or multi-rank prediction)
neg_ratio: Negative sample ratio for complex classification
edge_task: Choose between "score" (regression) or "interaction_type" (classification)

Nov 18 '25 14:11 grapentt

Dear Participants,

This is a final reminder regarding the upcoming challenge deadline.

📅 Deadline: Tomorrow, 25th November 2025

✅ Critical Requirement: Please ensure your branch is passing all CI/CD tests.

If you have any pending changes, please push them and verify your build status as soon as possible.

Good luck!

Nov 24 '25 12:11 levtelyatnikov