TopoBench icon indicating copy to clipboard operation
TopoBench copied to clipboard

Category: B1 (Bonus); Team name: TG; Dataset: ogbn-products

Open grapentt opened this issue 5 months ago β€’ 1 comments

πŸš€ B1 Bonus: On-Disk Transductive Learning with SQLite-Backed Structure Indexing

πŸ“Œ Problem Statement

Challenge: Training Topological Neural Networks (TNNs) on large transductive graphs (100K+ nodes) with complex topological structures (cliques, cycles, etc.) faces fundamental memory limitations:

  1. Structure Explosion: A graph with 100K nodes can contain millions of cliques/simplices
  2. Preprocessing Bottleneck: Computing and storing all topological structures requires O(N Γ— D^k) memory where:
    • N = number of nodes
    • D = average degree
    • k = structure size (e.g., 3 for triangles)
  3. Transductive Constraint: Unlike inductive learning (separate train/test graphs), transductive learning requires sampling from a single large graph with train/val/test masks

Real-World Impact: Popular benchmarks like ogbn-products (2.4M nodes, 61M edges) are currently infeasible for TNNs due to these memory constraints.


πŸ’‘ Our Approach: Two-Strategy Solution

We developed two complementary strategies for memory-efficient transductive TNN training:

Strategy 1: Structure-Centric Sampling 🎯

Guarantee: 100% Structure Completeness

# Sample structures FIRST, then extract nodes
structures = sample_k_cliques(target=500)  # Sample 500 cliques
nodes = union_of_nodes(structures)          # Extract their nodes
batch = build_subgraph(nodes, structures)   # Create batch

Key Innovation: Reverses traditional graph samplingβ€”we sample topological structures (cliques) first, then derive the node set. This guarantees all sampled structures are 100% complete in the batch (no missing nodes).

Strategy 2: Extended Context Sampling 🌐

Near-Complete Structures with Topology-Aware Heuristics

# Sample nodes FIRST (cluster-aware), then expand context
core_nodes = cluster_aware_sample(batch_size=1000)  # Louvain clustering
context_nodes = expand_context(core_nodes, ratio=1.5)  # +50% context
structures = query_structures(all_nodes)                 # Query structures
batch = build_subgraph(all_nodes, structures, core_mask)

Key Innovation: Uses graph topology (community detection) to sample dense regions, then adds context nodes to increase structure completeness. Distinguishes "core" vs "context" nodes for loss computation.


πŸ—οΈ Architecture & Workflow

Core Components

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  1. OnDiskTransductivePreprocessor                          β”‚
β”‚     - Converts PyG Data β†’ NetworkX graph                    β”‚
β”‚     - Detects topological structures (cliques up to size k) β”‚
β”‚     - Builds SQLite index via CliqueQueryEngine             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  2. SQLite Index (Persistent Disk Storage)                  β”‚
β”‚     - Table: cliques (id, size, node_list_json)            β”‚
β”‚     - Table: node_to_cliques (node_id, clique_id)          β”‚
β”‚     - Enables fast queries: O(log N) per structure         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  3. Strategy-Specific Samplers                              β”‚
β”‚     A. StructureCentricSampler                              β”‚
β”‚        - Samples structure IDs directly                     β”‚
β”‚        - Budget-aware: stops at node_budget                 β”‚
β”‚     B. ClusterAwareNodeSampler                              β”‚
β”‚        - Uses Louvain/KMeans clustering                     β”‚
β”‚        - Samples from dense communities                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  4. Collate Functions (On-Demand Structure Query)          β”‚
β”‚     - Query structures from SQLite index                    β”‚
β”‚     - Build PyG Data batch with subgraph                    β”‚
β”‚     - Apply topological transforms (liftings) at batch-timeβ”‚
β”‚     - Return: Data(x, edge_index, precomputed_structures)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  5. TransductiveSplitDataset (Seamless Integration)        β”‚
β”‚     - Wraps preprocessor + sampler + collate               β”‚
β”‚     - Compatible with TBDataloader (batch_size=1)          β”‚
β”‚     - Lazy materialization for memory efficiency           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Training Workflow

# 1. Preprocessing (One-Time, Cached on Disk)
preprocessor = OnDiskTransductivePreprocessor(
    graph_data=data,           # PyG Data object
    data_dir="./index",        # SQLite storage
    max_clique_size=3          # Detect up to 3-cliques
)
preprocessor.build_index()     # Stores to disk: ./index/structures.db

# 2. Create Splits (Automatic Batching)
split_config = OmegaConf.create({
    "strategy": "structure_centric",  # or "extended_context"
    "structures_per_batch": 500,      # Structure budget
    "node_budget": 2000,               # Node budget (prevents explosion)
    "shuffle": True
})
train, val, test = preprocessor.load_dataset_splits(split_config)

# 3. Training (Standard PyTorch Lightning)
datamodule = TBDataloader(train, val, test, batch_size=1)  # batch_size=1!
trainer.fit(model, datamodule)  # Batches pre-built, just iterate

# Memory: O(batch_size Γ— D^k) instead of O(N Γ— D^k) βœ…

✨ Key Innovations

1. SQLite-Backed Structure Index

  • Persistent storage: Index computed once, reused across runs
  • Fast queries: B-tree indexing enables O(log N) structure lookups
  • Scalable: Handles millions of structures without memory overhead
  • Node-to-structure mapping: Bidirectional queries for both strategies

2. Dual Sampling Strategies

  • Structure-Centric: Novel "structures-first" paradigm guarantees completeness
  • Extended Context: Topology-aware node sampling with context expansion
  • Trade-off selection: Users choose based on their priority (completeness vs. coverage)

3. Batch-Time Transform Application

  • Deferred lifting: Topological transforms (SimplicialCliqueLifting, etc.) applied during batch collation
  • Memory efficiency: Only lift mini-batch subgraphs, not entire graph
  • Flexibility: Each batch gets fresh transforms (supports data augmentation)

4. Seamless Integration with Existing Pipeline

# Works exactly like inductive learning!
train, val, test = preprocessor.load_dataset_splits(config)
datamodule = TBDataloader(train, val, test, batch_size=1)
trainer.fit(model, datamodule)
  • Zero API changes for users
  • Compatible with all existing models, transforms, and training loops
  • Drop-in replacement for standard preprocessing

5. Budget-Aware Sampling

  • Node budget: Prevents batch size explosion (critical for structure-centric)
  • Structure budget: Controls number of structures per batch
  • Adaptive: Sampler stops when budget reached, ensures consistent memory usage

Tutorials

βœ… tutorial_ondisk_transductive_structure_centric.ipynb  
   
βœ… tutorial_ondisk_transductive_extended_context.ipynb

Note: Due to time reason I could not finish this PR. Though most things should work, there are potentially some small issues. After the challenge is officially over, I am going to add benchmarks, tests and thoroughly refactor the code.

grapentt avatar Nov 25 '25 20:11 grapentt

Check out this pull request onΒ  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB