TopoBench icon indicating copy to clipboard operation
TopoBench copied to clipboard

Category: A1; Team name: Amiiiza; Dataset: ATLAS Top Tagging

Open amiiiza opened this issue 2 months ago • 3 comments

Checklist

  • [x] My pull request has a clear and explanatory title.
  • [x] My pull request passes the Linting test.
  • [x] I added appropriate unit tests and I made sure the code passes all unit tests. (refer to comment below)
  • [x] My PR follows PEP8 guidelines. (refer to comment below)
  • [x] My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
  • [x] I linked to issues and PRs that are relevant to this PR.

Description

This PR adds the ATLAS Top Tagging dataset from CERN Open Data to TopoBench’s pointcloud domain.
It introduces a full dataset implementation, loader, and configuration needed to use this dataset for binary top-quark jet tagging.

Main changes:

  • Added ATLASTopTaggingDataset with:
    • Support for constituent-level 4-vectors (pt, eta, phi, energy) for up to 200 particles per jet.
    • Optional high-level jet features (15 variables: mass, τ ratios, ECFs, etc.).
    • Configurable options for:
      • split (train/val/test),
      • subset fraction for fast experimentation,
      • max_constituents,
      • toggling high-level features.
  • Implemented ATLASTopTaggingDatasetLoader in the pointcloud data domain.
  • Implemented a preprocessing pipeline that:
    • Downloads the raw files from the CERN Open Data portal.
    • Handles both compressed .h5.gz and uncompressed .h5 input, with a fallback when .gz files are missing.
    • Saves preprocessed data to a reusable .pt file.
  • Added a stats() helper to summarize:
    • number of jets,
    • class distribution (signal/background),
    • average number of constituents per jet.
  • Added Hydra/OmegaConf configuration files to register the dataset and loader within the existing TopoBench experiment setup.

Issue

There is no issue associated with this PR.

Additional context

Data

Source: CERN Open Data Portal - Record 80030
Task: Binary classification (top quark jet tagging)
Size: ~93M events (~280GB compressed)
Features:

  • Constituent-level: 4-vectors for up to 200 particles per jet
    • pt (transverse momentum)
    • eta (pseudorapidity)
    • phi (azimuthal angle)
    • energy
  • High-level (optional): 15 jet-level features including mass, tau ratios, ECF, etc.

Testing

  • Added unit tests for:
    • download failure handling and logging,
    • preprocessing with .h5 fallback when no .h5.gz files exist,
    • preprocessing error handling when no files are found,
    • flexible HDF5 loading (compressed/uncompressed, label and feature extraction, slicing by num_jets),
    • graph construction, filtering, and transforms in process(),
    • stats() output (basic summary + label distribution).

amiiiza avatar Nov 25 '25 07:11 amiiiza

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 94.24%. Comparing base (5cc4932) to head (f22d9ed).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #246      +/-   ##
==========================================
+ Coverage   94.01%   94.24%   +0.23%     
==========================================
  Files         184      186       +2     
  Lines        6664     6936     +272     
==========================================
+ Hits         6265     6537     +272     
  Misses        399      399              

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Nov 25 '25 07:11 codecov[bot]

Hi @amiiiza! Two quick comments:

  • Don't worry about the project coverage, it failing is not related to your PR (I am trying to fix it now). Sorry for the inconveniences!

  • Did you fill out the required Google Form with the information of your PR? We don't find an entry assigned to your PR.

Thank you!

gbg141 avatar Nov 26 '25 02:11 gbg141

Hi @amiiiza! Two quick comments:

  • Don't worry about the project coverage, it failing is not related to your PR (I am trying to fix it now). Sorry for the inconveniences!
  • Did you fill out the required Google Form with the information of your PR? We don't find an entry assigned to your PR.

Thank you!

Hi @gbg141, really thanks for letting me know, and resolving the issue. I also submitted the form, it should be accessible now.

amiiiza avatar Nov 26 '25 06:11 amiiiza