Category: B1 (Bonus); Team name: TG; Dataset: ogbn-products
π B1 Bonus: On-Disk Transductive Learning with SQLite-Backed Structure Indexing
π Problem Statement
Challenge: Training Topological Neural Networks (TNNs) on large transductive graphs (100K+ nodes) with complex topological structures (cliques, cycles, etc.) faces fundamental memory limitations:
- Structure Explosion: A graph with 100K nodes can contain millions of cliques/simplices
-
Preprocessing Bottleneck: Computing and storing all topological structures requires O(N Γ D^k) memory where:
- N = number of nodes
- D = average degree
- k = structure size (e.g., 3 for triangles)
- Transductive Constraint: Unlike inductive learning (separate train/test graphs), transductive learning requires sampling from a single large graph with train/val/test masks
Real-World Impact: Popular benchmarks like ogbn-products (2.4M nodes, 61M edges) are currently infeasible for TNNs due to these memory constraints.
π‘ Our Approach: Two-Strategy Solution
We developed two complementary strategies for memory-efficient transductive TNN training:
Strategy 1: Structure-Centric Sampling π―
Guarantee: 100% Structure Completeness
# Sample structures FIRST, then extract nodes
structures = sample_k_cliques(target=500) # Sample 500 cliques
nodes = union_of_nodes(structures) # Extract their nodes
batch = build_subgraph(nodes, structures) # Create batch
Key Innovation: Reverses traditional graph samplingβwe sample topological structures (cliques) first, then derive the node set. This guarantees all sampled structures are 100% complete in the batch (no missing nodes).
Strategy 2: Extended Context Sampling π
Near-Complete Structures with Topology-Aware Heuristics
# Sample nodes FIRST (cluster-aware), then expand context
core_nodes = cluster_aware_sample(batch_size=1000) # Louvain clustering
context_nodes = expand_context(core_nodes, ratio=1.5) # +50% context
structures = query_structures(all_nodes) # Query structures
batch = build_subgraph(all_nodes, structures, core_mask)
Key Innovation: Uses graph topology (community detection) to sample dense regions, then adds context nodes to increase structure completeness. Distinguishes "core" vs "context" nodes for loss computation.
ποΈ Architecture & Workflow
Core Components
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. OnDiskTransductivePreprocessor β
β - Converts PyG Data β NetworkX graph β
β - Detects topological structures (cliques up to size k) β
β - Builds SQLite index via CliqueQueryEngine β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. SQLite Index (Persistent Disk Storage) β
β - Table: cliques (id, size, node_list_json) β
β - Table: node_to_cliques (node_id, clique_id) β
β - Enables fast queries: O(log N) per structure β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. Strategy-Specific Samplers β
β A. StructureCentricSampler β
β - Samples structure IDs directly β
β - Budget-aware: stops at node_budget β
β B. ClusterAwareNodeSampler β
β - Uses Louvain/KMeans clustering β
β - Samples from dense communities β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. Collate Functions (On-Demand Structure Query) β
β - Query structures from SQLite index β
β - Build PyG Data batch with subgraph β
β - Apply topological transforms (liftings) at batch-timeβ
β - Return: Data(x, edge_index, precomputed_structures) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 5. TransductiveSplitDataset (Seamless Integration) β
β - Wraps preprocessor + sampler + collate β
β - Compatible with TBDataloader (batch_size=1) β
β - Lazy materialization for memory efficiency β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Training Workflow
# 1. Preprocessing (One-Time, Cached on Disk)
preprocessor = OnDiskTransductivePreprocessor(
graph_data=data, # PyG Data object
data_dir="./index", # SQLite storage
max_clique_size=3 # Detect up to 3-cliques
)
preprocessor.build_index() # Stores to disk: ./index/structures.db
# 2. Create Splits (Automatic Batching)
split_config = OmegaConf.create({
"strategy": "structure_centric", # or "extended_context"
"structures_per_batch": 500, # Structure budget
"node_budget": 2000, # Node budget (prevents explosion)
"shuffle": True
})
train, val, test = preprocessor.load_dataset_splits(split_config)
# 3. Training (Standard PyTorch Lightning)
datamodule = TBDataloader(train, val, test, batch_size=1) # batch_size=1!
trainer.fit(model, datamodule) # Batches pre-built, just iterate
# Memory: O(batch_size Γ D^k) instead of O(N Γ D^k) β
β¨ Key Innovations
1. SQLite-Backed Structure Index
- Persistent storage: Index computed once, reused across runs
- Fast queries: B-tree indexing enables O(log N) structure lookups
- Scalable: Handles millions of structures without memory overhead
- Node-to-structure mapping: Bidirectional queries for both strategies
2. Dual Sampling Strategies
- Structure-Centric: Novel "structures-first" paradigm guarantees completeness
- Extended Context: Topology-aware node sampling with context expansion
- Trade-off selection: Users choose based on their priority (completeness vs. coverage)
3. Batch-Time Transform Application
- Deferred lifting: Topological transforms (SimplicialCliqueLifting, etc.) applied during batch collation
- Memory efficiency: Only lift mini-batch subgraphs, not entire graph
- Flexibility: Each batch gets fresh transforms (supports data augmentation)
4. Seamless Integration with Existing Pipeline
# Works exactly like inductive learning!
train, val, test = preprocessor.load_dataset_splits(config)
datamodule = TBDataloader(train, val, test, batch_size=1)
trainer.fit(model, datamodule)
- Zero API changes for users
- Compatible with all existing models, transforms, and training loops
- Drop-in replacement for standard preprocessing
5. Budget-Aware Sampling
- Node budget: Prevents batch size explosion (critical for structure-centric)
- Structure budget: Controls number of structures per batch
- Adaptive: Sampler stops when budget reached, ensures consistent memory usage
Tutorials
β
tutorial_ondisk_transductive_structure_centric.ipynb
β
tutorial_ondisk_transductive_extended_context.ipynb
Note: Due to time reason I could not finish this PR. Though most things should work, there are potentially some small issues. After the challenge is officially over, I am going to add benchmarks, tests and thoroughly refactor the code.
Check out this pull request onΒ ![]()
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB