training_extensions icon indicating copy to clipboard operation
training_extensions copied to clipboard

Add RF-DETR Object Detection Model

Open whittenator opened this issue 2 months ago • 1 comments

Is your feature request related to a problem? Please describe.

Currently, OpenVINO Training Extensions offers excellent support for CNN-based object detectors like YOLO variants, but there's a gap in state-of-the-art transformer-based architectures that achieve both high accuracy and real-time performance. While RT-DETR is supported, the ecosystem would benefit from newer transformer architectures that push the boundaries of accuracy-speed trade-offs, especially for applications requiring strong domain adaptability and end-to-end deployment without complex post-processing like Non-Maximum Suppression (NMS).

Describe the solution you'd like to propose.

I propose integrating RF-DETR (Roboflow Detection Transformer) as a native object detection model in OTX. RF-DETR is the first real-time transformer-based detector to surpass 60 mAP on COCO benchmark while maintaining competitive inference speeds. Key benefits include:

  • State-of-the-art performance: 60.5 mAP (Base) and 64.2 mAP (Large) on COCO, outperforming YOLOv11 and RT-DETR
  • Real-time inference: 25 FPS (Base) on NVIDIA T4, with new Nano/Small/Medium variants scaling to 100+ FPS
  • No NMS required: True end-to-end detection simplifies deployment and reduces latency
  • Apache 2.0 license: Fully compatible with OTX's open-source model
  • Domain adaptability: Superior performance on RF100-VL benchmark across 100+ diverse real-world domains (aerial, industrial, medical, etc.)
  • Multiple model sizes: Nano (3.2M params) → Large (129M params) for flexible deployment from edge to cloud

The integration would leverage RF-DETR's existing Python package and follow OTX's pattern of supporting transformer-based architectures, similar to how RT-DETR was integrated.

Describe alternatives you've considered.

  1. RT-DETR: Already supported in OTX, but RF-DETR offers higher accuracy (60+ vs ~54 mAP) and better domain generalization
  2. DEIM/DFINE: While fast, RF-DETR offers faster inference and higher accuracy.
  3. Custom implementation: Re-implementing from scratch would be redundant given RF-DETR's mature, open-source package and active maintenance by Roboflow

Additional context

  • Paper: "RF-DETR: Neural Architecture Search for Real-Time Detection Transformers"
  • Repository: https://github.com/roboflow/rf-detr
  • PyPI Package: rf-detr (Apache 2.0 license)
  • Architecture: Built on DINOv2 backbone + LW-DETR with Deformable Attention, offering excellent transfer learning capabilities
  • Industry adoption: RF100-VL benchmark is used by Apple, Microsoft, Baidu for evaluating real-world detector performance
  • OTX alignment: Fits perfectly with OTX's roadmap of integrating Transformers library and third-party backends while maintaining unified CLI/API

This would position OTX as the premier framework for both CNN and transformer-based real-time detection, giving users more options for accuracy-speed trade-offs without leaving the OTX ecosystem.

whittenator avatar Nov 13 '25 19:11 whittenator

Thanks for the proposal, @whittenator. Good news, we are already planning to extend our suite of models with RF-DETR and other architectures. 😸

leoll2 avatar Nov 14 '25 10:11 leoll2