box_area and box_iou functions for cxcywh format
🚀 The feature
Native box_area and box_iou functions for cxcywh format.
Motivation, pitch
Since the cxcywh format is common, we can use faster and simpler functions to calculate box area and box IoU directly in this format.
Currently, we first need to convert to the xyxy format before using box_area and box_iou in torchvision.
Alternatives
No response
Additional context
I can open a pull request from alperenunlu/vision@cf93d9e
Hey @alperenunlu, thanks for your neat PR with the attached test case!
This definitely provides a more straightforward way to compute area and IoU for cxcywhr format. At the same time, this also adds some code, which is slightly redundant with the functions for xyxy format.
I'd like to understand the benefits of using these new functions over a two-step process: first convert the bounding box format and then compute IoU and areas. Could you please help me understand the trade-offs between these approaches? What are the advantages of having separate functions for cxcywhr format, and how do they outweigh the added complexity? I am trying to understand if the box conversion is actually the bottleneck in data pipelines involving these operations, and what could be the gain in having dedicated optimized functions to address it.
Thanks in advance for your input! Best regards, Antoine
Hi @AntoineSimoulin,
Thanks a lot for the thoughtful feedback!
Just to clarify up front — this PR targets the standard cxcywh format (center-x, center-y, width, height), not cxcywhr with rotation. There’s no orientation handling here (though box_area_center would technically still work with cxcywhr by ignoring the angle component).
The motivation behind these functions comes from workflows where bounding boxes are already in cxcywh format — not limited to, but including models like YOLO and calculating mAP if the format is already cxcywh. This format is common in various pipelines, both for models and preprocessing steps.
Here are a few key advantages of providing native support:
-
Performance: Avoiding the conversion to
xyxysaves time in tight loops — particularly in training, where IoU and area computations are applied to many boxes across all cells and predictions, or during evaluation of mAP. - Precision and type handling: When input data is in integer format (as is common in datasets or quantized models), converting to xyxy typically requires casting to float, which adds computational overhead. If the data needs to remain in integer form, casting back from float can lead to loss of precision due to rounding. Native operations in cxcywh help avoid these extra conversions and maintain tighter control over types and numerical consistency. (This is why the test cases needed INT_BOXES_CXCYWH — converting to xyxy casts to float and recasting to integer remove fractional part.)
-
Cleaner, safer code: In pipelines that operate directly in
cxcywh, having native functions avoids back-and-forth conversions and reduces the risk of subtle bugs or inconsistencies.
I understand the concern around added code — I’ve tried to keep the implementation minimal, well-contained and tested, and I’m definitely open to suggestions.
Thanks again — happy to discuss further!
Best,
Alperen
Hey @alperenunlu, yeah sorry for the confusion, I meant the cxcywh format (center-x, center-y, width, height). I think all of this makes sense. Would it be possible for you to produce a small benchmark to illustrate the gains in term of performance, precision and type handling? It will be extremely useful to justify the decision to add the code. Let me know what is possible for you. Thanks a lot for your time and efforts!
Hey @AntoineSimoulin,
I've extensively profiled the code and included both the implementation and output below. Here's a summary of the results:
-
box_area_centeris approximately 10x faster -
box_iou_centeris about 1.25x faster
Under more realistic conditions (fewer boxes), the improvements are even more meaningful:
-
box_area_centeris 6x faster -
box_iou_centeris 2x faster
These benchmarks were run on a T4 GPU, and the results are consistent with my tests on an M1 MacBook (both CPU and MPS backends).
I also ran a separate benchmark using perf_counter_ns, which showed:
- 6x speedup for the area function
- 1.7x speedup for the IoU function
To ensure consistency, I ran 10 iterations across box counts ranging from 1 to 1001 (in steps of 5). The GPU performance gains remain consistent throughout.
One thing to note: while the GPU speedups are stable, on CPU, the performance gain for the IoU function diminishes once comparisons exceed 100x100 boxes. At that point, the IoU computation becomes the bottleneck, and the speedup drops to around 1x.
Feel free to test it further!
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Area 2 Step 0.00% 0.000us 0.00% 0.000us 0.000us 2.276s 1213.53% 2.276s 11.380ms 0 b 0 b 0 b 0 b 200
void at::native::elementwise_kernel<128, 2, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 94.302ms 50.28% 94.302ms 3.949us 0 b 0 b 0 b 0 b 23880
Area 2 Step 16.71% 716.465ms 54.50% 2.337s 11.687ms 0.000us 0.00% 93.617ms 468.083us 0 b -8 b 0 b -62.96 Mb 200
aten::mul 14.79% 634.346ms 24.56% 1.053s 50.630us 73.897ms 39.40% 73.929ms 3.554us 1.52 Mb 1.52 Mb 42.97 Mb 42.97 Mb 20800
aten::sub 11.42% 489.867ms 18.63% 799.095ms 47.565us 63.103ms 33.65% 63.129ms 3.758us 1.52 Mb 1.52 Mb 34.38 Mb 34.38 Mb 16800
void at::native::elementwise_kernel<128, 2, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 61.857ms 32.98% 61.857ms 3.885us 0 b 0 b 0 b 0 b 15920
aten::add 4.94% 212.045ms 7.19% 308.375ms 35.043us 31.516ms 16.80% 31.516ms 3.581us 1.52 Mb 1.52 Mb 17.19 Mb 17.19 Mb 8800
aten::stack 2.36% 101.184ms 11.35% 486.610ms 110.593us 0.000us 0.00% 18.682ms 4.246us 3.04 Mb 0 b 31.38 Mb 0 b 4400
aten::cat 4.36% 186.857ms 5.79% 248.417ms 56.458us 18.665ms 9.95% 18.682ms 4.246us 3.04 Mb 3.04 Mb 31.38 Mb 31.38 Mb 4400
void at::native::(anonymous namespace)::CatArrayBatc... 0.00% 0.000us 0.00% 0.000us 0.000us 18.665ms 9.95% 18.665ms 4.666us 0 b 0 b 0 b 0 b 4000
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 11.828ms 6.31% 11.828ms 2.957us 0 b 0 b 0 b 0 b 4000
aten::to 0.18% 7.706ms 6.94% 297.843ms 212.745us 0.000us 0.00% 349.500us 0.250us 4.69 Kb 0 b 1.57 Mb 0 b 1400
aten::_to_copy 0.43% 18.509ms 6.77% 290.137ms 207.241us 0.000us 0.00% 349.500us 0.250us 4.69 Kb 0 b 1.57 Mb 0 b 1400
aten::copy_ 0.28% 12.124ms 0.42% 18.129ms 12.950us 349.500us 0.19% 349.500us 0.250us 0 b 0 b 0 b 0 b 1400
Memcpy HtoD (Pageable -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 349.500us 0.19% 349.500us 1.748us 0 b 0 b 0 b 0 b 200
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 4.289s
Self CUDA time total: 187.546ms
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Area 1 Step 0.00% 0.000us 0.00% 0.000us 0.000us 126.239ms 783.14% 126.239ms 631.194us 0 b 0 b 0 b 0 b 200
aten::mul 17.54% 54.825ms 31.54% 98.557ms 20.533us 15.768ms 97.82% 15.774ms 3.286us 1.52 Mb 1.52 Mb 8.59 Mb 8.59 Mb 4800
void at::native::elementwise_kernel<128, 2, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 15.714ms 97.49% 15.714ms 3.948us 0 b 0 b 0 b 0 b 3980
Area 1 Step 24.81% 77.535ms 50.17% 156.796ms 783.981us 0.000us 0.00% 7.884ms 39.419us 0 b 0 b 0 b -4.30 Mb 200
aten::to 1.40% 4.388ms 7.82% 24.445ms 17.461us 0.000us 0.00% 351.893us 0.251us 4.69 Kb 0 b 1.57 Mb 0 b 1400
aten::_to_copy 2.25% 7.046ms 6.42% 20.058ms 14.327us 0.000us 0.00% 351.893us 0.251us 4.69 Kb 0 b 1.57 Mb 0 b 1400
aten::copy_ 1.22% 3.816ms 2.71% 8.473ms 6.052us 351.893us 2.18% 351.893us 0.251us 0 b 0 b 0 b 0 b 1400
Memcpy HtoD (Pageable -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 351.893us 2.18% 351.893us 1.759us 0 b 0 b 0 b 0 b 200
void at::native::unrolled_elementwise_kernel<at::nat... 0.00% 0.000us 0.00% 0.000us 0.000us 53.470us 0.33% 53.470us 2.673us 0 b 0 b 0 b 0 b 20
cudaLaunchKernel 12.22% 38.188ms 12.24% 38.261ms 9.565us 0.000us 0.00% 6.464us 0.002us 0 b 0 b 0 b 0 b 4000
Unrecognized 0.02% 73.437us 0.02% 73.437us 36.718us 6.464us 0.04% 6.464us 3.232us 0 b 0 b 0 b 0 b 2
aten::rand 1.09% 3.413ms 3.22% 10.062ms 12.578us 0.000us 0.00% 0.000us 0.000us 1.52 Mb 0 b 0 b 0 b 800
aten::empty 0.59% 1.856ms 0.59% 1.856ms 2.320us 0.000us 0.00% 0.000us 0.000us 1.52 Mb 1.52 Mb 0 b 0 b 800
aten::uniform_ 1.53% 4.792ms 1.53% 4.792ms 5.991us 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 800
[memory] 0.00% 0.000us 0.00% 0.000us 0.000us 0.000us 0.00% 0.000us 0.000us -9.89 Mb -9.89 Mb -5.85 Mb -5.85 Mb 7817
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 312.503ms
Self CUDA time total: 16.119ms
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
IoU 2 Step 0.00% 0.000us 0.00% 0.000us 0.000us 2.479s 240.37% 2.479s 12.396ms 0 b 0 b 0 b 0 b 200
IoU 2 Step 19.78% 862.283ms 57.56% 2.509s 12.543ms 0.000us 0.00% 515.414ms 2.577ms 0 b 8 b 0 b -29.95 Gb 200
aten::sub 10.00% 435.799ms 16.57% 722.345ms 17.364us 305.982ms 29.67% 305.989ms 7.355us 3.04 Mb 3.04 Mb 15.03 Gb 15.03 Gb 41600
void at::native::elementwise_kernel<128, 2, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 238.418ms 23.12% 238.418ms 4.608us 0 b 0 b 0 b 0 b 51740
aten::mul 11.24% 489.732ms 18.53% 807.787ms 17.715us 225.472ms 21.86% 225.497ms 4.945us 3.04 Mb 3.04 Mb 5.07 Gb 5.07 Gb 45600
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 180.455ms 17.50% 180.455ms 22.278us 0 b 0 b 0 b 0 b 8100
void at::native::elementwise_kernel<128, 2, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 127.079ms 12.32% 127.079ms 3.991us 0 b 0 b 0 b 0 b 31840
aten::add 4.70% 204.810ms 7.63% 332.734ms 15.404us 113.323ms 10.99% 113.331ms 5.247us 3.04 Mb 3.04 Mb 5.02 Gb 5.02 Gb 21600
aten::min 0.17% 7.548ms 2.14% 93.468ms 23.367us 0.000us 0.00% 97.495ms 24.374us 0 b 0 b 9.89 Gb 0 b 4000
aten::minimum 1.22% 53.213ms 1.97% 85.920ms 21.480us 97.487ms 9.45% 97.495ms 24.374us 0 b 0 b 9.89 Gb 9.89 Gb 4000
void at::native::elementwise_kernel<128, 2, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 97.424ms 9.45% 97.424ms 24.478us 0 b 0 b 0 b 0 b 3980
aten::max 0.20% 8.687ms 3.13% 136.315ms 34.079us 0.000us 0.00% 94.151ms 23.538us 0 b 0 b 9.89 Gb 0 b 4000
aten::maximum 1.45% 63.235ms 2.93% 127.628ms 31.907us 94.138ms 9.13% 94.151ms 23.538us 0 b 0 b 9.89 Gb 9.89 Gb 4000
void at::native::elementwise_kernel<128, 2, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 94.075ms 9.12% 94.075ms 23.637us 0 b 0 b 0 b 0 b 3980
aten::clamp 1.47% 64.229ms 2.76% 120.170ms 30.042us 85.156ms 8.26% 85.165ms 21.291us 0 b 0 b 9.93 Gb 9.93 Gb 4000
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 4.358s
Self CUDA time total: 1.031s
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
IoU 1 Step 0.00% 0.000us 0.00% 0.000us 0.000us 2.197s 269.32% 2.197s 10.985ms 0 b 0 b 0 b 0 b 200
IoU 1 Step 21.92% 843.976ms 58.19% 2.241s 11.203ms 0.000us 0.00% 407.653ms 2.038ms 0 b 8 b 0 b -29.90 Gb 200
aten::sub 6.39% 245.836ms 10.37% 399.351ms 22.690us 212.517ms 26.05% 212.539ms 12.076us 3.04 Mb 3.04 Mb 14.99 Gb 14.99 Gb 17600
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 180.064ms 22.07% 180.064ms 22.230us 0 b 0 b 0 b 0 b 8100
aten::div 8.97% 345.251ms 15.25% 587.278ms 28.235us 136.253ms 16.70% 136.264ms 6.551us 1.52 Mb 1.52 Mb 5.07 Gb 5.07 Gb 20800
void at::native::elementwise_kernel<128, 2, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 115.930ms 14.21% 115.930ms 5.826us 0 b 0 b 0 b 0 b 19900
aten::mul 6.11% 235.362ms 10.46% 402.684ms 29.609us 108.396ms 13.29% 108.438ms 7.973us 3.04 Mb 3.04 Mb 5.00 Gb 5.00 Gb 13600
void at::native::elementwise_kernel<128, 2, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 108.236ms 13.27% 108.236ms 9.065us 0 b 0 b 0 b 0 b 11940
aten::min 0.31% 11.982ms 2.90% 111.731ms 27.933us 0.000us 0.00% 94.949ms 23.737us 0 b 0 b 9.89 Gb 0 b 4000
aten::minimum 1.62% 62.294ms 2.59% 99.749ms 24.937us 94.938ms 11.64% 94.949ms 23.737us 0 b 0 b 9.89 Gb 9.89 Gb 4000
void at::native::elementwise_kernel<128, 2, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 94.878ms 11.63% 94.878ms 23.839us 0 b 0 b 0 b 0 b 3980
aten::max 0.28% 10.898ms 2.99% 114.965ms 28.741us 0.000us 0.00% 94.356ms 23.589us 0 b 0 b 9.89 Gb 0 b 4000
aten::maximum 1.81% 69.625ms 2.70% 104.067ms 26.017us 94.356ms 11.57% 94.356ms 23.589us 0 b 0 b 9.89 Gb 9.89 Gb 4000
void at::native::elementwise_kernel<128, 2, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 94.295ms 11.56% 94.295ms 23.692us 0 b 0 b 0 b 0 b 3980
aten::clamp 2.05% 79.075ms 3.12% 120.056ms 30.014us 85.122ms 10.43% 85.122ms 21.281us 0 b 0 b 9.93 Gb 9.93 Gb 4000
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 3.850s
Self CUDA time total: 815.776ms
@alperenunlu Amazing! The benchmark looks good. Can you submit a PR with the changes from cf93d9e?
I was thinking that instead of publicly exposing box_iou_center, we could rename it box_iou_cxcywh. We could add a parameter in_fmt: str = xyxy to box_iou and depending on the bounding box input format either use to the original box_iou function or dispatch to box_iou_cxcywh (kind of similar to what's done for box_convert). Let me know what you think. Thanks for your time and contribution!
@AntoineSimoulin Thanks! Let’s add it this way for now. Since there are other IoU functions (generalized, distance, and complete), I can work on them afterward. Once we have them, we can gradually shift to a dispatched style in a future version update.
What do you think?
This is the PR: #8992 I can update the branch then we can merge it.
@NicolasHug Could you also take a look?
I created the dispatch style and changed tests accordingly. All tests are successful. pr: https://github.com/pytorch/vision/pull/8992