Category: B2; Team name: NeuroTriangles; Dataset: A123CortexM (Mouse Auditory Cortex)
Checklist
- [x] My pull request has a clear and explanatory title following the challenge format.
- [x] My pull request passes linting.
- [x] I added appropriate unit tests and made sure the code passes all unit tests.
- [x] My PR follows PEP8 guidelines.
- [x] My code is documented using numpy-style docstrings.
- [x] I included a pipeline test showing that a model can train on the new benchmark task.
- [x] This PR introduces at most one dataset loader, as required by the challenge rules.
Description
This PR is a Category B2 (“Pioneering New TDL Benchmark Tasks”) submission to the TAG-DS Topological Deep Learning Challenge 2025.
It integrates the Bowen et al. (2024) mouse auditory cortex calcium-imaging dataset into TopoBench under the name:
A123CortexM — A1/A2/3 mouse auditory cortex correlation graphs,
and builds a small family of topology-aware benchmark tasks:
- Graph-level BF-bin classification (standard graph classification).
- Triangle role classification (2-simplex motif roles combining embedding × weight).
- Triangle common-neighbour prediction (topological embedding depth of triangles).
All of these are driven by a single dataset / loader pair and configured via the specific_task parameter in the dataset YAML.
In addition, the PR introduces a generic triangle utility in topobench.data.utils.triangle_classifier that can be reused by other datasets.
Dataset and graph construction
The underlying data come from:
Bowen et al. (2024), “Fractured columnar small-world functional network organization in volumes of L2/3 of mouse auditory cortex”, PNAS Nexus.
Each recording session provides:
- Neuronal activity traces,
- Pairwise signal correlations (
SigCorrs) and noise correlations (NoiseCorrsTrial), - Best-frequency (BF) values per neuron and layer annotations.
The dataset class:
-
topobench/data/datasets/a123.py-
A123CortexMDataset(InMemoryDataset)
-
performs the following steps:
-
Download & unpack Uses
download_file_from_linkto fetch the “Auditory cortex data” archive and extract it underraw/. -
Session / layer extraction For each
.matfile and each layer (1–5), it reads:-
SigCorrs(signal correlation matrix), -
NoiseCorrsTrial(trial-level noise correlations), -
BFInfo[layer]["BFval"](per-neuron best frequency).
-
-
BF-bin subgraphs Neurons are binned by BF into
n_bins(default 9). For each (session, layer, BF-bin) with at leastmin_neuronsneurons (default from config, 3 for tests):- Correlation and noise-correlation matrices are restricted to those neurons,
- A sample dictionary is built with metadata:
{session_file, session_id, layer, bf_bin, neuron_indices, corr, noise_corr}.
-
Graph representation (
_sample_to_pyg_data) Each sample becomes atorch_geometric.data.Datagraph with:-
Nodes: neurons in a single (session, layer, BF-bin).
-
Node features
x ∈ ℝ^{n×3}:-
mean_corr: mean signal correlation to others, -
std_corr: standard deviation of signal correlation, -
noise_diag: diagonal entries of the noise-correlation matrix (per-neuron noise level).
-
-
Edges: undirected edges between neuron pairs whose signal correlation ≥
corr_threshold(configurable;corr_threshold: 0.2in the YAML). Edges are constructed from the upper triangle and symmetrised withto_undirected. -
Edge attributes: correlation weights on those edges.
-
Label
y: integer BF-bin in[0, num_classes − 1](confignum_classes: 9). -
Metadata:
session_id,layer.
During
process(), graphs with no edges are filtered out, and the remaining graphs are collated and stored inprocessed/data.pt. -
The dataset behaviour is controlled by the YAML configs/dataset/graph/a123.yaml.
For CI we restrict to num_graphs: 10 to keep runtime reasonable; users can increase this for full experiments.
Generic triangle utilities
To make triangle-based benchmarks reusable across datasets, this PR adds:
-
topobench/data/utils/triangle_classifier.py
with the base class:
class TriangleClassifier:
"""
Generic triangle utility for weighted graphs.
- enumerate_triangles(G): list of (a, b, c) triangles
- classify_and_weight_triangles(triangles, G): attaches edge weights + domain-specific roles
- extract_triangles(edge_index, edge_weights, num_nodes): convenience from PyG graphs
"""
The base methods provide:
- Efficient triangle enumeration on a NetworkX graph,
- Edge weight extraction per triangle,
- A hook for domain-specific role definitions via
_classify_roleand_role_to_label(which are intentionally left abstract).
This utility is then specialised for the auditory cortex dataset, but can be reused by other TopoBench datasets to define their own triangle-based tasks.
A123-specific triangle classifier
In topobench/data/datasets/a123.py we define:
class TriangleClassifier(BaseTriangleClassifier):
...
This subclass implements domain-specific logic for auditory cortex correlation graphs with the data appropriate classes based on the number of neighbors and the weights class.
Tasks
The dataset’s process() method always builds the graph dataset and then inspects:
specific_task = self.parameters.get("specific_task", "classification")
to optionally build triangle-level tasks:
1. Graph-level BF-bin classification (specific_task: classification)
- Task level: graph.
- Label: BF-bin (0–8).
-
Config:
task_level: graph,num_classes: 9,loss_type: cross_entropy.
This is a standard graph classification benchmark on correlation graphs, suitable as a baseline and as input for higher-order liftings.
2. Triangle role classification (specific_task: triangle_classification)
-
Implemented in:
-
A123CortexMDataset._extract_triangles_from_graphs() -
A123CortexMDataset.create_triangle_classification_task()
-
Step 1 – Triangle extraction
_extract_triangles_from_graphs():
-
Iterates over all graphs in the dataset,
-
Builds a NetworkX graph
Gfor each (with edge weights from signal correlations), -
Uses
TriangleClassifier.enumerate_triangles(G)andclassify_and_weight_triangles()to obtain triangle dicts with:-
nodes:(a, b, c), -
edge_weights:[w_ab, w_bc, w_ac], -
role: role string, -
label: 0–8.
-
A list of raw triangle records is collected, each with:
graph_idx, tri (triangle dict), G, num_nodes.
Step 2 – Building the triangle dataset
create_triangle_classification_task() converts these into torch_geometric.data.Data objects:
-
x ∈ ℝ^{1×3}: the three edge weights (purely topological/functional – no node features or BF info), -
y: integer role label in{0, …, 8}, - Metadata:
nodes,role,graph_idx.
This defines a triangle-level classification benchmark targeting 2-simplex motif roles.
3. Triangle common-neighbour prediction (specific_task: triangle_common_neighbors)
-
Implemented in:
-
A123CortexMDataset.create_triangle_common_neighbors_task()
-
Here we focus on a purely structural topological quantity: the number of common neighbours of each triangle.
For each triangle (a, b, c) in the raw list:
-
Compute the set of common neighbours:
common = set(G.neighbors(a)) & set(G.neighbors(b)) & set(G.neighbors(c)) - {a, b, c} num_common = len(common) -
Define the label:
-
Exact common-neighbour count, capped at 8:
- 0–7 → class 0–7,
- ≥8 → class 8.
-
-
Define the features:
-
Node degrees of the triangle vertices in
G:deg_a = G.degree(a) deg_b = G.degree(b) deg_c = G.degree(c) x = [deg_a, deg_b, deg_c]
So each triangle sample has
x ∈ ℝ^{1×3}(degrees) andy ∈ {0, …, 8}(binned common neighbours). -
This gives a triangle-level classification task where labels are higher-order topological statistics (coface-like information), and features are structural (degrees), avoiding direct leakage of the label.
Loader and task selection
The loader:
-
topobench/data/loaders/graph/a123_loader.py-
A123DatasetLoader(AbstractLoader)
-
does:
-
Reads
data_nameandspecific_taskfromparameters. -
Constructs
A123CortexMDataset(root, name, parameters). -
Depending on
specific_task:-
classification→ Uses the default graph dataset fromprocessed/data.pt. -
triangle_classification→ Loads triangle dataset fromprocessed/data_triangles.pt(if it exists) and assigns it toself.dataset.data/self.dataset.slices. -
triangle_common_neighbors→ Loads triangle CN (common neighbors) dataset fromprocessed/data_triangles_common_neighbors.pt.
-
If the triangle files are missing, the loader emits a clear warning suggesting to ensure that the dataset has been processed with the appropriate specific_task.
This keeps the one loader per PR rule, while making triangle tasks selectable via configuration.
Tests and pipeline integration
To satisfy the challenge requirements:
-
Unit tests
-
test/data/load/test_a123_dataset.pychecks:- Creation and basic properties of A123 graphs,
- Correct behaviour of the loader,
- Correct loading of triangle datasets for
triangle_classification/triangle_common_neighbors(e.g. shapes ofx, valid label ranges, non-empty datasets in test settings).
-
-
Pipeline test
-
test/pipeline/test_pipeline.pyis extended with a configuration that:- Uses the A123 config (with e.g.
specific_task: triangle_classificationandnum_graphs: 10), - Trains an existing TopoBench model for
max_epochs=2, - Logs metrics such as
train/accuracy,val/accuracy,test/accuracy, macroprecision,recall, andF1.
- Uses the A123 config (with e.g.
This demonstrates that the entire training pipeline runs successfully on the new benchmark task. Performance is not tuned; the goal is compatibility and coverage.
-
-
Coverage
-
Tests exercise:
-
A123CortexMDataset.process, - Triangle extraction / classification / CN logic,
-
A123DatasetLoader.load_datasetfor multiplespecific_tasksettings, - The generic
topobench.data.utils.triangle_classifierutility.
-
This helps maintain the ≥93% Codecov target.
Note: we weren't able to check the coverage using Codecov due to dependency incompatibilities.
-
Why this is a useful B2 benchmark
This contribution adds:
- A real, biophysically grounded brain dataset (mouse auditory cortex), and
- Two explicit triangle-level tasks that are naturally suited to topological models.
Key points for TDL:
- Triangles are 2-simplices of the clique complex of the correlation graph.
- Triangle roles combine internal functional strength and higher-order embedding (common neighbours).
- Common-neighbour prediction targets a coface-like topological statistic directly.
These are exactly the kind of questions where simplicial networks, cell-complex networks, and hypergraph networks should shine compared to edge-only GNNs:
- They can operate directly on 2-cells and their cofaces,
- They can capture how information “flows” through triangles embedded in local motifs,
- They can more naturally encode constraints on multi-neuron interactions (beyond pairwise edges).
Because everything is driven through a single YAML (specific_task switch) and a reusable triangle utility, the benchmark is also extensible: other datasets can plug into topobench.data.utils.triangle_classifier and define their own domain-specific triangle roles or CN-style tasks.
Limitations and future directions
- Triangle roles currently use fixed correlation thresholds and simple bins; more refined roles could incorporate spatial distances, laminar structure, or alternative measures (e.g. causal TE networks).
- The common-neighbour task is currently treated as a 9-class classification problem; a regression version or ordinal metrics would be natural extensions.
- The CI config only uses
num_graphs: 10; full experiments on the whole dataset will likely reveal richer distributions of motif types and CN counts.
Relation to previous work
This PR builds on my earlier contributions to data loading and streaming in TopoBench (previous PR: #241), but is focused on the new TDL benchmark tasks (Category B2), with an emphasis on higher-order structure in functional brain networks.
Note: This PR duplicates the changes from #241 because, without the updated
download_file_from_linkfunction, the dataset cannot be downloaded, and the submission would not run.
References
- Bowen, Z., et al. (2024). Fractured columnar small-world functional network organization in volumes of L2/3 of mouse auditory cortex. PNAS Nexus, 3(2), pgae074.
Check out this pull request on ![]()
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
Hi @marindigen! It seems that the testing error can be easily fixed just by slightly renaming the newly introduced test_io_utils.py file. Good luck!
The CI failed with a no space left on device error on the GitHub runner (infrastructure issue, not code-related), which could be due to the size of the dataset (~2 GB). What could be a possible solution to this?
The CI failed with a no space left on device error on the GitHub runner (infrastructure issue, not code-related), which could be due to the size of the dataset (~2 GB). What could be a possible solution to this?
Dear @marindigen, one possible solution is to mock the data instead of downloading it. Please refer to PR:233 for the reference if needed.
Hi again @marindigen! Could you please comment out (or turn to markdown) the content of tutorial_train_brain_model.ipynb? We decided to accept your submission given the cause of the failing test.
Thank you!
Hi @gbg141 and @levtelyatnikov!
Thank you for your patience and for accepting the submission as it is! I have converted the tutorial to the markdown file.