TopoBench icon indicating copy to clipboard operation
TopoBench copied to clipboard

Category: A2; Team name: HugoWalter; Dataset: Cornell Labeled Nodes Hypergraphs

Open ziraax opened this issue 6 months ago • 0 comments

Checklist

  • [x] My pull request has a clear and explanatory title.
  • [x] My pull request passes the Linting test.
  • [x] I added appropriate unit tests and I made sure the code passes all unit tests.
  • [x] My PR follows PEP8 guidelines.
  • [x] My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
  • [x] I linked to issues and PRs that are relevant to this PR.

Description

This PR implements support for 8 single-label Cornell hypergraph datasets for TDL Challenge Category A.2. The implementation includes:

  • Dataset class: CornellLabeledNodesDataset with automatic Google Drive download, parsing of Cornell text format, 0-indexing, and node deduplication
  • Loader: CornellLabeledNodesDatasetLoader for seamless TopoBench pipeline integration
  • Refactored parsing logic: Shared parse_cornell_hypergraph_files() utility to eliminate code duplication
  • Configuration: 8 Hydra config files with detailed dataset descriptions and statistics
  • Comprehensive testing: 72 tests (8 datasets x 9 test functions) with session-scoped caching for efficient handling of large datasets (amazon-reviews: 2.2M nodes)

Datasets added:

  • walmart-trips (88,860 nodes, 11 classes)
  • house-committees (1,290 nodes, 2 classes)
  • senate-committees (282 nodes, 2 classes)
  • house-bills (1,494 nodes, 2 classes)
  • senate-bills (294 nodes, 2 classes)
  • contact-primary-school (242 nodes, 11 classes)
  • contact-high-school (327 nodes, 9 classes)
  • amazon-reviews (2.2M nodes, 29 classes)

Time/Space complexity: Datasets use sparse COO tensors for memory-efficient incidence matrix representation.

Issue

This PR addresses TDL Challenge Category A.2 requirements for adding real-world hypergraph datasets. These datasets enable:

  • Community detection research (political affiliation, product departments)
  • Social network analysis (face-to-face contact patterns)
  • E-commerce behavior modeling (shopping patterns, product co-purchases)

The implementation provides single-label node classification benchmarks spanning small (242 nodes) to very large (2.2M nodes) hypergraphs across diverse domains.

Not implemented:

  • stackoverflow-answers and mathoverflow-answers: These are multi-label datasets (nodes can have multiple tags) requiring different architecture. Deferred to separate PR.
  • trivago-clicks: Statistics from downloaded data do not match Cornell website specifications. Requires verification before implementation.

Additional context

  • Pipeline test verified. (pytest test/pipeline/test_pipeline.py)
  • All 72 tests passing (pytest test/data/load/test_cornell_labeled_nodes.py)
  • All code passes ruff linting and numpydoc validation

ziraax avatar Nov 12 '25 00:11 ziraax