TopoBench Category: B2; Team name: NeuroTriangles; Dataset: A123CortexM (Mouse Auditory Cortex)

Checklist

[x] My pull request has a clear and explanatory title following the challenge format.
[x] My pull request passes linting.
[x] I added appropriate unit tests and made sure the code passes all unit tests.
[x] My PR follows PEP8 guidelines.
[x] My code is documented using numpy-style docstrings.
[x] I included a pipeline test showing that a model can train on the new benchmark task.
[x] This PR introduces at most one dataset loader, as required by the challenge rules.

Description

This PR is a Category B2 (“Pioneering New TDL Benchmark Tasks”) submission to the TAG-DS Topological Deep Learning Challenge 2025.

It integrates the Bowen et al. (2024) mouse auditory cortex calcium-imaging dataset into TopoBench under the name:

A123CortexM — A1/A2/3 mouse auditory cortex correlation graphs,

and builds a small family of topology-aware benchmark tasks:

Graph-level BF-bin classification (standard graph classification).
Triangle role classification (2-simplex motif roles combining embedding × weight).
Triangle common-neighbour prediction (topological embedding depth of triangles).

All of these are driven by a single dataset / loader pair and configured via the specific_task parameter in the dataset YAML.

In addition, the PR introduces a generic triangle utility in topobench.data.utils.triangle_classifier that can be reused by other datasets.

Dataset and graph construction

The underlying data come from:

Bowen et al. (2024), “Fractured columnar small-world functional network organization in volumes of L2/3 of mouse auditory cortex”, PNAS Nexus.

Each recording session provides:

Neuronal activity traces,
Pairwise signal correlations (SigCorrs) and noise correlations (NoiseCorrsTrial),
Best-frequency (BF) values per neuron and layer annotations.

The dataset class:

topobench/data/datasets/a123.py
- A123CortexMDataset(InMemoryDataset)

performs the following steps:

Download & unpack Uses download_file_from_link to fetch the “Auditory cortex data” archive and extract it under raw/.
Session / layer extraction For each .mat file and each layer (1–5), it reads:
- SigCorrs (signal correlation matrix),
- NoiseCorrsTrial (trial-level noise correlations),
- BFInfo[layer]["BFval"] (per-neuron best frequency).
BF-bin subgraphs Neurons are binned by BF into n_bins (default 9). For each (session, layer, BF-bin) with at least min_neurons neurons (default from config, 3 for tests):
- Correlation and noise-correlation matrices are restricted to those neurons,
- A sample dictionary is built with metadata: {session_file, session_id, layer, bf_bin, neuron_indices, corr, noise_corr}.
Graph representation (_sample_to_pyg_data) Each sample becomes a torch_geometric.data.Data graph with:
- Nodes: neurons in a single (session, layer, BF-bin).
- Node features x ∈ ℝ^{n×3}:
  - mean_corr: mean signal correlation to others,
  - std_corr: standard deviation of signal correlation,
  - noise_diag: diagonal entries of the noise-correlation matrix (per-neuron noise level).
- Edges: undirected edges between neuron pairs whose signal correlation ≥ corr_threshold (configurable; corr_threshold: 0.2 in the YAML). Edges are constructed from the upper triangle and symmetrised with to_undirected.
- Edge attributes: correlation weights on those edges.
- Label y: integer BF-bin in [0, num_classes − 1] (config num_classes: 9).
- Metadata: session_id, layer.
During process(), graphs with no edges are filtered out, and the remaining graphs are collated and stored in processed/data.pt.

The dataset behaviour is controlled by the YAML configs/dataset/graph/a123.yaml. For CI we restrict to num_graphs: 10 to keep runtime reasonable; users can increase this for full experiments.

Generic triangle utilities

To make triangle-based benchmarks reusable across datasets, this PR adds:

topobench/data/utils/triangle_classifier.py

with the base class:

class TriangleClassifier:
    """
    Generic triangle utility for weighted graphs.
    - enumerate_triangles(G): list of (a, b, c) triangles
    - classify_and_weight_triangles(triangles, G): attaches edge weights + domain-specific roles
    - extract_triangles(edge_index, edge_weights, num_nodes): convenience from PyG graphs
    """

The base methods provide:

Efficient triangle enumeration on a NetworkX graph,
Edge weight extraction per triangle,
A hook for domain-specific role definitions via _classify_role and _role_to_label (which are intentionally left abstract).

This utility is then specialised for the auditory cortex dataset, but can be reused by other TopoBench datasets to define their own triangle-based tasks.

A123-specific triangle classifier

In topobench/data/datasets/a123.py we define:

class TriangleClassifier(BaseTriangleClassifier):
    ...

This subclass implements domain-specific logic for auditory cortex correlation graphs with the data appropriate classes based on the number of neighbors and the weights class.

Tasks

The dataset’s process() method always builds the graph dataset and then inspects:

specific_task = self.parameters.get("specific_task", "classification")

to optionally build triangle-level tasks:

1. Graph-level BF-bin classification (`specific_task: classification`)

Task level: graph.
Label: BF-bin (0–8).
Config: task_level: graph, num_classes: 9, loss_type: cross_entropy.

This is a standard graph classification benchmark on correlation graphs, suitable as a baseline and as input for higher-order liftings.

2. Triangle role classification (`specific_task: triangle_classification`)

Implemented in:
- A123CortexMDataset._extract_triangles_from_graphs()
- A123CortexMDataset.create_triangle_classification_task()

Step 1 – Triangle extraction

_extract_triangles_from_graphs():

Iterates over all graphs in the dataset,
Builds a NetworkX graph G for each (with edge weights from signal correlations),
Uses TriangleClassifier.enumerate_triangles(G) and classify_and_weight_triangles() to obtain triangle dicts with:
- nodes: (a, b, c),
- edge_weights: [w_ab, w_bc, w_ac],
- role: role string,
- label: 0–8.

A list of raw triangle records is collected, each with: graph_idx, tri (triangle dict), G, num_nodes.

Step 2 – Building the triangle dataset

create_triangle_classification_task() converts these into torch_geometric.data.Data objects:

x ∈ ℝ^{1×3}: the three edge weights (purely topological/functional – no node features or BF info),
y: integer role label in {0, …, 8},
Metadata: nodes, role, graph_idx.

This defines a triangle-level classification benchmark targeting 2-simplex motif roles.

3. Triangle common-neighbour prediction (`specific_task: triangle_common_neighbors`)

Implemented in:
- A123CortexMDataset.create_triangle_common_neighbors_task()

Here we focus on a purely structural topological quantity: the number of common neighbours of each triangle.

For each triangle (a, b, c) in the raw list:

Compute the set of common neighbours:

common = set(G.neighbors(a)) & set(G.neighbors(b)) & set(G.neighbors(c)) - {a, b, c}
num_common = len(common)

Define the label:
- Exact common-neighbour count, capped at 8:
  - 0–7 → class 0–7,
  - ≥8 → class 8.
Define the features:
- Node degrees of the triangle vertices in G:
```
deg_a = G.degree(a)
deg_b = G.degree(b)
deg_c = G.degree(c)
x = [deg_a, deg_b, deg_c]
```
So each triangle sample has x ∈ ℝ^{1×3} (degrees) and y ∈ {0, …, 8} (binned common neighbours).

This gives a triangle-level classification task where labels are higher-order topological statistics (coface-like information), and features are structural (degrees), avoiding direct leakage of the label.

Loader and task selection

The loader:

topobench/data/loaders/graph/a123_loader.py
- A123DatasetLoader(AbstractLoader)

does:

Reads data_name and specific_task from parameters.
Constructs A123CortexMDataset(root, name, parameters).
Depending on specific_task:
- classification → Uses the default graph dataset from processed/data.pt.
- triangle_classification → Loads triangle dataset from processed/data_triangles.pt (if it exists) and assigns it to self.dataset.data / self.dataset.slices.
- triangle_common_neighbors → Loads triangle CN (common neighbors) dataset from processed/data_triangles_common_neighbors.pt.

If the triangle files are missing, the loader emits a clear warning suggesting to ensure that the dataset has been processed with the appropriate specific_task.

This keeps the one loader per PR rule, while making triangle tasks selectable via configuration.

Tests and pipeline integration

To satisfy the challenge requirements:

Unit tests
- test/data/load/test_a123_dataset.py checks:
  - Creation and basic properties of A123 graphs,
  - Correct behaviour of the loader,
  - Correct loading of triangle datasets for triangle_classification / triangle_common_neighbors (e.g. shapes of x, valid label ranges, non-empty datasets in test settings).
Pipeline test
- test/pipeline/test_pipeline.py is extended with a configuration that:
  - Uses the A123 config (with e.g. specific_task: triangle_classification and num_graphs: 10),
  - Trains an existing TopoBench model for max_epochs=2,
  - Logs metrics such as train/accuracy, val/accuracy, test/accuracy, macro precision, recall, and F1.
This demonstrates that the entire training pipeline runs successfully on the new benchmark task. Performance is not tuned; the goal is compatibility and coverage.
Coverage
- Tests exercise:
  - A123CortexMDataset.process,
  - Triangle extraction / classification / CN logic,
  - A123DatasetLoader.load_dataset for multiple specific_task settings,
  - The generic topobench.data.utils.triangle_classifier utility.
This helps maintain the ≥93% Codecov target.

Note: we weren't able to check the coverage using Codecov due to dependency incompatibilities.

Why this is a useful B2 benchmark

This contribution adds:

A real, biophysically grounded brain dataset (mouse auditory cortex), and
Two explicit triangle-level tasks that are naturally suited to topological models.

Key points for TDL:

Triangles are 2-simplices of the clique complex of the correlation graph.
Triangle roles combine internal functional strength and higher-order embedding (common neighbours).
Common-neighbour prediction targets a coface-like topological statistic directly.

These are exactly the kind of questions where simplicial networks, cell-complex networks, and hypergraph networks should shine compared to edge-only GNNs:

They can operate directly on 2-cells and their cofaces,
They can capture how information “flows” through triangles embedded in local motifs,
They can more naturally encode constraints on multi-neuron interactions (beyond pairwise edges).

Because everything is driven through a single YAML (specific_task switch) and a reusable triangle utility, the benchmark is also extensible: other datasets can plug into topobench.data.utils.triangle_classifier and define their own domain-specific triangle roles or CN-style tasks.

Limitations and future directions

Triangle roles currently use fixed correlation thresholds and simple bins; more refined roles could incorporate spatial distances, laminar structure, or alternative measures (e.g. causal TE networks).
The common-neighbour task is currently treated as a 9-class classification problem; a regression version or ordinal metrics would be natural extensions.
The CI config only uses num_graphs: 10; full experiments on the whole dataset will likely reveal richer distributions of motif types and CN counts.

Relation to previous work

This PR builds on my earlier contributions to data loading and streaming in TopoBench (previous PR: #241), but is focused on the new TDL benchmark tasks (Category B2), with an emphasis on higher-order structure in functional brain networks.

Note: This PR duplicates the changes from #241 because, without the updated download_file_from_link function, the dataset cannot be downloaded, and the submission would not run.

References

Bowen, Z., et al. (2024). Fractured columnar small-world functional network organization in volumes of L2/3 of mouse auditory cortex. PNAS Nexus, 3(2), pgae074.

Nov 25 '25 22:11 marindigen

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Nov 25 '25 22:11 review-notebook-app[bot]

Hi @marindigen! It seems that the testing error can be easily fixed just by slightly renaming the newly introduced test_io_utils.py file. Good luck!

Nov 26 '25 07:11 gbg141

The CI failed with a no space left on device error on the GitHub runner (infrastructure issue, not code-related), which could be due to the size of the dataset (~2 GB). What could be a possible solution to this?

Nov 26 '25 10:11 marindigen

The CI failed with a no space left on device error on the GitHub runner (infrastructure issue, not code-related), which could be due to the size of the dataset (~2 GB). What could be a possible solution to this?

Dear @marindigen, one possible solution is to mock the data instead of downloading it. Please refer to PR:233 for the reference if needed.

Nov 26 '25 11:11 levtelyatnikov

Hi again @marindigen! Could you please comment out (or turn to markdown) the content of tutorial_train_brain_model.ipynb? We decided to accept your submission given the cause of the failing test. Thank you!

Nov 26 '25 19:11 gbg141

Hi @gbg141 and @levtelyatnikov!

Thank you for your patience and for accepting the submission as it is! I have converted the tutorial to the markdown file.

Nov 27 '25 16:11 marindigen

Category: B2; Team name: NeuroTriangles; Dataset: A123CortexM (Mouse Auditory Cortex)

Checklist

Description

Dataset and graph construction

Generic triangle utilities

A123-specific triangle classifier

Tasks

1. Graph-level BF-bin classification (specific_task: classification)

2. Triangle role classification (specific_task: triangle_classification)

3. Triangle common-neighbour prediction (specific_task: triangle_common_neighbors)

Loader and task selection

Tests and pipeline integration

Why this is a useful B2 benchmark

Limitations and future directions

Relation to previous work

References

1. Graph-level BF-bin classification (`specific_task: classification`)

2. Triangle role classification (`specific_task: triangle_classification`)

3. Triangle common-neighbour prediction (`specific_task: triangle_common_neighbors`)