TopoBench
TopoBench copied to clipboard
Category: A2; Team name: HugoWalter; Dataset: Cornell Labeled Nodes Hypergraphs
Checklist
- [x] My pull request has a clear and explanatory title.
- [x] My pull request passes the Linting test.
- [x] I added appropriate unit tests and I made sure the code passes all unit tests.
- [x] My PR follows PEP8 guidelines.
- [x] My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
- [x] I linked to issues and PRs that are relevant to this PR.
Description
This PR implements support for 8 single-label Cornell hypergraph datasets for TDL Challenge Category A.2. The implementation includes:
- Dataset class:
CornellLabeledNodesDatasetwith automatic Google Drive download, parsing of Cornell text format, 0-indexing, and node deduplication - Loader:
CornellLabeledNodesDatasetLoaderfor seamless TopoBench pipeline integration - Refactored parsing logic: Shared
parse_cornell_hypergraph_files()utility to eliminate code duplication - Configuration: 8 Hydra config files with detailed dataset descriptions and statistics
- Comprehensive testing: 72 tests (8 datasets x 9 test functions) with session-scoped caching for efficient handling of large datasets (amazon-reviews: 2.2M nodes)
Datasets added:
- walmart-trips (88,860 nodes, 11 classes)
- house-committees (1,290 nodes, 2 classes)
- senate-committees (282 nodes, 2 classes)
- house-bills (1,494 nodes, 2 classes)
- senate-bills (294 nodes, 2 classes)
- contact-primary-school (242 nodes, 11 classes)
- contact-high-school (327 nodes, 9 classes)
- amazon-reviews (2.2M nodes, 29 classes)
Time/Space complexity: Datasets use sparse COO tensors for memory-efficient incidence matrix representation.
Issue
This PR addresses TDL Challenge Category A.2 requirements for adding real-world hypergraph datasets. These datasets enable:
- Community detection research (political affiliation, product departments)
- Social network analysis (face-to-face contact patterns)
- E-commerce behavior modeling (shopping patterns, product co-purchases)
The implementation provides single-label node classification benchmarks spanning small (242 nodes) to very large (2.2M nodes) hypergraphs across diverse domains.
Not implemented:
- stackoverflow-answers and mathoverflow-answers: These are multi-label datasets (nodes can have multiple tags) requiring different architecture. Deferred to separate PR.
- trivago-clicks: Statistics from downloaded data do not match Cornell website specifications. Requires verification before implementation.
Additional context
- Pipeline test verified. (
pytest test/pipeline/test_pipeline.py) - All 72 tests passing (
pytest test/data/load/test_cornell_labeled_nodes.py) - All code passes ruff linting and numpydoc validation