supervision MeanAverageRecall does not follow COCO: mAR@K should use top-K detections per image, not globally

Search before asking

[x] I have searched the Supervision issues and found no similar bug report.

Bug

Describe the bug

The current implementation of MeanAverageRecall computes mAR@K by selecting the top-K predictions across all images in the dataset, rather than selecting the top-K predictions per image. According to the COCO evaluation protocol, mAR@K should be calculated by considering the top-K highest-confidence detections for each image.

Average Recall (AR): AR^max=K AR given K detections per image

This issue occurs because, in the concatenation step below, all detection results are merged together without keeping track of which image each detection came from. As a result, the subsequent selection of top-K predictions is performed globally across the entire dataset, rather than per image.

https://github.com/roboflow/supervision/blob/deb1c9c4f4b0cd678416a67c8a13f2ef8ed6878f/supervision/metrics/mean_average_recall.py#L222-L225

Proposed Solution

To address this issue, I have modified the _compute and _compute_average_recall_for_classes functions so that only the top-K detections per image are considered when calculating mAR@K, in accordance with the COCO evaluation protocol. In both functions, instead of simply concatenating all detections, I've modified the process to first filter the statistics by confidence score before concatenating them and calculating the confusion matrix. I will submit a pull request with these changes shortly.

Environment

Supervision 0.26.1
OS: Ubuntu 24.04
Python: 3.12.3

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

[x] Yes I'd like to help by submitting a PR!

Sep 22 '25 03:09 stop1one

Excellent technical analysis @stop1one! This is a critical bug that affects the validity of mAR@K evaluations for object detection models.

You're absolutely correct that the current implementation violates the COCO protocol. The key issue is indeed in the concatenation step where np.concatenate(*stats, 0) merges detections globally rather than maintaining per-image boundaries for top-K selection.

Technical Impact:

Bias toward high-density images: Images with many detections dominate the global top-K selection
Inconsistent with COCO benchmarks: Results won't match official COCO evaluation tools
Affects model comparison: Different models may show artificially different mAR@K scores

Validation Approach: To verify the fix in PR #1967, I'd suggest:

Comparing results with official COCO evaluation tools (pycocotools)
Testing with datasets having varied detection densities per image
Checking that mAR@100 ≥ mAR@10 (mathematical property that should hold)

This is exactly the kind of subtle but important bug that can slip through - the global top-K selection might even give "reasonable looking" results while being fundamentally incorrect.

Thanks for the detailed analysis and the fix! Looking forward to reviewing PR #1967.

Best regards,
Gabriel

Sep 27 '25 14:09 galafis

I would like to solve the issue for hacktoberfest

Oct 01 '25 11:10 aviralgarg05

@aviralgarg05 Thanks for your interest. But I’ve already opened PR #1967 which resolves this issue, so it’s currently being reviewed.

Oct 01 '25 12:10 stop1one