Category: A1; Team name: Amiiiza; Dataset: ATLAS Top Tagging
Checklist
- [x] My pull request has a clear and explanatory title.
- [x] My pull request passes the Linting test.
- [x] I added appropriate unit tests and I made sure the code passes all unit tests. (refer to comment below)
- [x] My PR follows PEP8 guidelines. (refer to comment below)
- [x] My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
- [x] I linked to issues and PRs that are relevant to this PR.
Description
This PR adds the ATLAS Top Tagging dataset from CERN Open Data to TopoBench’s pointcloud domain.
It introduces a full dataset implementation, loader, and configuration needed to use this dataset for binary top-quark jet tagging.
Main changes:
- Added
ATLASTopTaggingDatasetwith:- Support for constituent-level 4-vectors (pt, eta, phi, energy) for up to 200 particles per jet.
- Optional high-level jet features (15 variables: mass, τ ratios, ECFs, etc.).
- Configurable options for:
-
split(train/val/test), -
subsetfraction for fast experimentation, -
max_constituents, - toggling high-level features.
-
- Implemented
ATLASTopTaggingDatasetLoaderin the pointcloud data domain. - Implemented a preprocessing pipeline that:
- Downloads the raw files from the CERN Open Data portal.
- Handles both compressed
.h5.gzand uncompressed.h5input, with a fallback when.gzfiles are missing. - Saves preprocessed data to a reusable
.ptfile.
- Added a
stats()helper to summarize:- number of jets,
- class distribution (signal/background),
- average number of constituents per jet.
- Added Hydra/OmegaConf configuration files to register the dataset and loader within the existing TopoBench experiment setup.
Issue
There is no issue associated with this PR.
Additional context
Data
Source: CERN Open Data Portal - Record 80030
Task: Binary classification (top quark jet tagging)
Size: ~93M events (~280GB compressed)
Features:
-
Constituent-level: 4-vectors for up to 200 particles per jet
-
pt(transverse momentum) -
eta(pseudorapidity) -
phi(azimuthal angle) -
energy
-
- High-level (optional): 15 jet-level features including mass, tau ratios, ECF, etc.
Testing
- Added unit tests for:
- download failure handling and logging,
- preprocessing with
.h5fallback when no.h5.gzfiles exist, - preprocessing error handling when no files are found,
- flexible HDF5 loading (compressed/uncompressed, label and feature extraction, slicing by
num_jets), - graph construction, filtering, and transforms in
process(), -
stats()output (basic summary + label distribution).
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 94.24%. Comparing base (5cc4932) to head (f22d9ed).
Additional details and impacted files
@@ Coverage Diff @@
## main #246 +/- ##
==========================================
+ Coverage 94.01% 94.24% +0.23%
==========================================
Files 184 186 +2
Lines 6664 6936 +272
==========================================
+ Hits 6265 6537 +272
Misses 399 399
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Hi @amiiiza! Two quick comments:
-
Don't worry about the project coverage, it failing is not related to your PR (I am trying to fix it now). Sorry for the inconveniences!
-
Did you fill out the required Google Form with the information of your PR? We don't find an entry assigned to your PR.
Thank you!
Hi @amiiiza! Two quick comments:
- Don't worry about the project coverage, it failing is not related to your PR (I am trying to fix it now). Sorry for the inconveniences!
- Did you fill out the required Google Form with the information of your PR? We don't find an entry assigned to your PR.
Thank you!
Hi @gbg141, really thanks for letting me know, and resolving the issue. I also submitted the form, it should be accessible now.