MeanAverageRecall does not follow COCO: mAR@K should use top-K detections per image, not globally
Search before asking
- [x] I have searched the Supervision issues and found no similar bug report.
Bug
Describe the bug
The current implementation of MeanAverageRecall computes mAR@K by selecting the top-K predictions across all images in the dataset, rather than selecting the top-K predictions per image. According to the COCO evaluation protocol, mAR@K should be calculated by considering the top-K highest-confidence detections for each image.
Average Recall (AR): ARmax=K AR given K detections per image
This issue occurs because, in the concatenation step below, all detection results are merged together without keeping track of which image each detection came from. As a result, the subsequent selection of top-K predictions is performed globally across the entire dataset, rather than per image.
https://github.com/roboflow/supervision/blob/deb1c9c4f4b0cd678416a67c8a13f2ef8ed6878f/supervision/metrics/mean_average_recall.py#L222-L225
Proposed Solution
To address this issue, I have modified the _compute and _compute_average_recall_for_classes functions so that only the top-K detections per image are considered when calculating mAR@K, in accordance with the COCO evaluation protocol.
In both functions, instead of simply concatenating all detections, I've modified the process to first filter the statistics by confidence score before concatenating them and calculating the confusion matrix. I will submit a pull request with these changes shortly.
Environment
- Supervision 0.26.1
- OS: Ubuntu 24.04
- Python: 3.12.3
Minimal Reproducible Example
No response
Additional
No response
Are you willing to submit a PR?
- [x] Yes I'd like to help by submitting a PR!
Excellent technical analysis @stop1one! This is a critical bug that affects the validity of mAR@K evaluations for object detection models.
You're absolutely correct that the current implementation violates the COCO protocol. The key issue is indeed in the concatenation step where np.concatenate(*stats, 0) merges detections globally rather than maintaining per-image boundaries for top-K selection.
Technical Impact:
- Bias toward high-density images: Images with many detections dominate the global top-K selection
- Inconsistent with COCO benchmarks: Results won't match official COCO evaluation tools
- Affects model comparison: Different models may show artificially different mAR@K scores
Validation Approach: To verify the fix in PR #1967, I'd suggest:
- Comparing results with official COCO evaluation tools (pycocotools)
- Testing with datasets having varied detection densities per image
- Checking that mAR@100 ≥ mAR@10 (mathematical property that should hold)
This is exactly the kind of subtle but important bug that can slip through - the global top-K selection might even give "reasonable looking" results while being fundamentally incorrect.
Thanks for the detailed analysis and the fix! Looking forward to reviewing PR #1967.
Best regards,
Gabriel
I would like to solve the issue for hacktoberfest
@aviralgarg05 Thanks for your interest. But I’ve already opened PR #1967 which resolves this issue, so it’s currently being reviewed.