Category: A2; Team name: HugoWalter; Dataset: Cornell Labeled Nodes Hypergraphs

Open ziraax opened this issue 6 months ago • 0 comments

Checklist

[x] My pull request has a clear and explanatory title.
[x] My pull request passes the Linting test.
[x] I added appropriate unit tests and I made sure the code passes all unit tests.
[x] My PR follows PEP8 guidelines.
[x] My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
[x] I linked to issues and PRs that are relevant to this PR.

Description

This PR implements support for 8 single-label Cornell hypergraph datasets for TDL Challenge Category A.2. The implementation includes:

Dataset class: CornellLabeledNodesDataset with automatic Google Drive download, parsing of Cornell text format, 0-indexing, and node deduplication
Loader: CornellLabeledNodesDatasetLoader for seamless TopoBench pipeline integration
Refactored parsing logic: Shared parse_cornell_hypergraph_files() utility to eliminate code duplication
Configuration: 8 Hydra config files with detailed dataset descriptions and statistics
Comprehensive testing: 72 tests (8 datasets x 9 test functions) with session-scoped caching for efficient handling of large datasets (amazon-reviews: 2.2M nodes)

Datasets added:

walmart-trips (88,860 nodes, 11 classes)
house-committees (1,290 nodes, 2 classes)
senate-committees (282 nodes, 2 classes)
house-bills (1,494 nodes, 2 classes)
senate-bills (294 nodes, 2 classes)
contact-primary-school (242 nodes, 11 classes)
contact-high-school (327 nodes, 9 classes)
amazon-reviews (2.2M nodes, 29 classes)

Time/Space complexity: Datasets use sparse COO tensors for memory-efficient incidence matrix representation.

Issue

This PR addresses TDL Challenge Category A.2 requirements for adding real-world hypergraph datasets. These datasets enable:

Community detection research (political affiliation, product departments)
Social network analysis (face-to-face contact patterns)
E-commerce behavior modeling (shopping patterns, product co-purchases)

The implementation provides single-label node classification benchmarks spanning small (242 nodes) to very large (2.2M nodes) hypergraphs across diverse domains.

Not implemented:

stackoverflow-answers and mathoverflow-answers: These are multi-label datasets (nodes can have multiple tags) requiring different architecture. Deferred to separate PR.
trivago-clicks: Statistics from downloaded data do not match Cornell website specifications. Requires verification before implementation.

Additional context

Pipeline test verified. (pytest test/pipeline/test_pipeline.py)
All 72 tests passing (pytest test/data/load/test_cornell_labeled_nodes.py)
All code passes ruff linting and numpydoc validation

Nov 12 '25 00:11 ziraax