Category: B1 (Bonus); Team name: TG; Dataset: ogbn-products

Open grapentt opened this issue 5 months ago • 1 comments

🚀 B1 Bonus: On-Disk Transductive Learning with SQLite-Backed Structure Indexing

📌 Problem Statement

Challenge: Training Topological Neural Networks (TNNs) on large transductive graphs (100K+ nodes) with complex topological structures (cliques, cycles, etc.) faces fundamental memory limitations:

Structure Explosion: A graph with 100K nodes can contain millions of cliques/simplices
Preprocessing Bottleneck: Computing and storing all topological structures requires O(N × D^k) memory where:
- N = number of nodes
- D = average degree
- k = structure size (e.g., 3 for triangles)
Transductive Constraint: Unlike inductive learning (separate train/test graphs), transductive learning requires sampling from a single large graph with train/val/test masks

Real-World Impact: Popular benchmarks like ogbn-products (2.4M nodes, 61M edges) are currently infeasible for TNNs due to these memory constraints.

💡 Our Approach: Two-Strategy Solution

We developed two complementary strategies for memory-efficient transductive TNN training:

Strategy 1: Structure-Centric Sampling 🎯

Guarantee: 100% Structure Completeness

# Sample structures FIRST, then extract nodes
structures = sample_k_cliques(target=500)  # Sample 500 cliques
nodes = union_of_nodes(structures)          # Extract their nodes
batch = build_subgraph(nodes, structures)   # Create batch

Key Innovation: Reverses traditional graph sampling—we sample topological structures (cliques) first, then derive the node set. This guarantees all sampled structures are 100% complete in the batch (no missing nodes).

Strategy 2: Extended Context Sampling 🌐

Near-Complete Structures with Topology-Aware Heuristics

# Sample nodes FIRST (cluster-aware), then expand context
core_nodes = cluster_aware_sample(batch_size=1000)  # Louvain clustering
context_nodes = expand_context(core_nodes, ratio=1.5)  # +50% context
structures = query_structures(all_nodes)                 # Query structures
batch = build_subgraph(all_nodes, structures, core_mask)

Key Innovation: Uses graph topology (community detection) to sample dense regions, then adds context nodes to increase structure completeness. Distinguishes "core" vs "context" nodes for loss computation.

🏗️ Architecture & Workflow

Core Components

┌─────────────────────────────────────────────────────────────┐
│  1. OnDiskTransductivePreprocessor                          │
│     - Converts PyG Data → NetworkX graph                    │
│     - Detects topological structures (cliques up to size k) │
│     - Builds SQLite index via CliqueQueryEngine             │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│  2. SQLite Index (Persistent Disk Storage)                  │
│     - Table: cliques (id, size, node_list_json)            │
│     - Table: node_to_cliques (node_id, clique_id)          │
│     - Enables fast queries: O(log N) per structure         │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│  3. Strategy-Specific Samplers                              │
│     A. StructureCentricSampler                              │
│        - Samples structure IDs directly                     │
│        - Budget-aware: stops at node_budget                 │
│     B. ClusterAwareNodeSampler                              │
│        - Uses Louvain/KMeans clustering                     │
│        - Samples from dense communities                     │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│  4. Collate Functions (On-Demand Structure Query)          │
│     - Query structures from SQLite index                    │
│     - Build PyG Data batch with subgraph                    │
│     - Apply topological transforms (liftings) at batch-time│
│     - Return: Data(x, edge_index, precomputed_structures)  │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│  5. TransductiveSplitDataset (Seamless Integration)        │
│     - Wraps preprocessor + sampler + collate               │
│     - Compatible with TBDataloader (batch_size=1)          │
│     - Lazy materialization for memory efficiency           │
└─────────────────────────────────────────────────────────────┘

Training Workflow

# 1. Preprocessing (One-Time, Cached on Disk)
preprocessor = OnDiskTransductivePreprocessor(
    graph_data=data,           # PyG Data object
    data_dir="./index",        # SQLite storage
    max_clique_size=3          # Detect up to 3-cliques
)
preprocessor.build_index()     # Stores to disk: ./index/structures.db

# 2. Create Splits (Automatic Batching)
split_config = OmegaConf.create({
    "strategy": "structure_centric",  # or "extended_context"
    "structures_per_batch": 500,      # Structure budget
    "node_budget": 2000,               # Node budget (prevents explosion)
    "shuffle": True
})
train, val, test = preprocessor.load_dataset_splits(split_config)

# 3. Training (Standard PyTorch Lightning)
datamodule = TBDataloader(train, val, test, batch_size=1)  # batch_size=1!
trainer.fit(model, datamodule)  # Batches pre-built, just iterate

# Memory: O(batch_size × D^k) instead of O(N × D^k) ✅

✨ Key Innovations

1. SQLite-Backed Structure Index

Persistent storage: Index computed once, reused across runs
Fast queries: B-tree indexing enables O(log N) structure lookups
Scalable: Handles millions of structures without memory overhead
Node-to-structure mapping: Bidirectional queries for both strategies

2. Dual Sampling Strategies

Structure-Centric: Novel "structures-first" paradigm guarantees completeness
Extended Context: Topology-aware node sampling with context expansion
Trade-off selection: Users choose based on their priority (completeness vs. coverage)

3. Batch-Time Transform Application

Deferred lifting: Topological transforms (SimplicialCliqueLifting, etc.) applied during batch collation
Memory efficiency: Only lift mini-batch subgraphs, not entire graph
Flexibility: Each batch gets fresh transforms (supports data augmentation)

4. Seamless Integration with Existing Pipeline

# Works exactly like inductive learning!
train, val, test = preprocessor.load_dataset_splits(config)
datamodule = TBDataloader(train, val, test, batch_size=1)
trainer.fit(model, datamodule)

Zero API changes for users
Compatible with all existing models, transforms, and training loops
Drop-in replacement for standard preprocessing

5. Budget-Aware Sampling

Node budget: Prevents batch size explosion (critical for structure-centric)
Structure budget: Controls number of structures per batch
Adaptive: Sampler stops when budget reached, ensures consistent memory usage

Tutorials

✅ tutorial_ondisk_transductive_structure_centric.ipynb  
   
✅ tutorial_ondisk_transductive_extended_context.ipynb

Note: Due to time reason I could not finish this PR. Though most things should work, there are potentially some small issues. After the challenge is officially over, I am going to add benchmarks, tests and thoroughly refactor the code.

Nov 25 '25 20:11 grapentt

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Nov 26 '25 11:11 review-notebook-app[bot]