Add RF-DETR Object Detection Model
Is your feature request related to a problem? Please describe.
Currently, OpenVINO Training Extensions offers excellent support for CNN-based object detectors like YOLO variants, but there's a gap in state-of-the-art transformer-based architectures that achieve both high accuracy and real-time performance. While RT-DETR is supported, the ecosystem would benefit from newer transformer architectures that push the boundaries of accuracy-speed trade-offs, especially for applications requiring strong domain adaptability and end-to-end deployment without complex post-processing like Non-Maximum Suppression (NMS).
Describe the solution you'd like to propose.
I propose integrating RF-DETR (Roboflow Detection Transformer) as a native object detection model in OTX. RF-DETR is the first real-time transformer-based detector to surpass 60 mAP on COCO benchmark while maintaining competitive inference speeds. Key benefits include:
- State-of-the-art performance: 60.5 mAP (Base) and 64.2 mAP (Large) on COCO, outperforming YOLOv11 and RT-DETR
- Real-time inference: 25 FPS (Base) on NVIDIA T4, with new Nano/Small/Medium variants scaling to 100+ FPS
- No NMS required: True end-to-end detection simplifies deployment and reduces latency
- Apache 2.0 license: Fully compatible with OTX's open-source model
- Domain adaptability: Superior performance on RF100-VL benchmark across 100+ diverse real-world domains (aerial, industrial, medical, etc.)
- Multiple model sizes: Nano (3.2M params) → Large (129M params) for flexible deployment from edge to cloud
The integration would leverage RF-DETR's existing Python package and follow OTX's pattern of supporting transformer-based architectures, similar to how RT-DETR was integrated.
Describe alternatives you've considered.
- RT-DETR: Already supported in OTX, but RF-DETR offers higher accuracy (60+ vs ~54 mAP) and better domain generalization
- DEIM/DFINE: While fast, RF-DETR offers faster inference and higher accuracy.
- Custom implementation: Re-implementing from scratch would be redundant given RF-DETR's mature, open-source package and active maintenance by Roboflow
Additional context
- Paper: "RF-DETR: Neural Architecture Search for Real-Time Detection Transformers"
- Repository: https://github.com/roboflow/rf-detr
-
PyPI Package:
rf-detr(Apache 2.0 license) - Architecture: Built on DINOv2 backbone + LW-DETR with Deformable Attention, offering excellent transfer learning capabilities
- Industry adoption: RF100-VL benchmark is used by Apple, Microsoft, Baidu for evaluating real-world detector performance
- OTX alignment: Fits perfectly with OTX's roadmap of integrating Transformers library and third-party backends while maintaining unified CLI/API
This would position OTX as the premier framework for both CNN and transformer-based real-time detection, giving users more options for accuracy-speed trade-offs without leaving the OTX ecosystem.
Thanks for the proposal, @whittenator. Good news, we are already planning to extend our suite of models with RF-DETR and other architectures. 😸