training_extensions icon indicating copy to clipboard operation
training_extensions copied to clipboard

Add DEIMV2 Object Detection Model

Open kprokofi opened this issue 5 months ago • 9 comments

Summary

resolves #5015

  • [x] Add DinoV3 and VIT tiny as a backbones for detection, primarily for DeimV2 model
  • [x] Add DEIMV2 model (OTXModel, Encoder, Decoder), e2e training, export
  • [x] Experiment with pre-processing, Copy-blend, EMA, learning rate and its schedule, model weights
  • [x] Add Unit tests, perf tests
  • [x] Provide final benchmark numbers (vs other DETR variants)

How to test

otx train --config src/otx/recipe/detection/deimv2_l.yaml --data_root tests/assests/car_tree_bug

Checklist

  • [x] The PR title and description are clear and descriptive
  • [x] I have manually tested the changes
  • [x] All changes are covered by automated tests
  • [x] All related issues are linked to this PR (if applicable)
  • [x] Documentation has been updated (if applicable)

kprokofi avatar Nov 25 '25 13:11 kprokofi

Model Manifests to be updated after a decision regarding the DETR models we want to expose

kprokofi avatar Dec 02 '25 22:12 kprokofi

Final benchmark: (averaged across all datasets) image

List of the datasets: image

Number of runs: 5 (5 different seeds)

kprokofi avatar Dec 10 '25 09:12 kprokofi

otx_version task model training:epoch_mean training:epoch_std training:e2e_time_mean training:e2e_time_std training:gpu_mem_mean training:gpu_mem_std training:train/iter_time_mean training:train/iter_time_std training:val/f1-score_mean training:val/f1-score_std torch:test/f1-score_mean torch:test/f1-score_std torch:test/iter_time_mean torch:test/iter_time_std torch:test/latency_mean torch:test/latency_std torch:test/e2e_time_mean torch:test/e2e_time_std export:test/f1-score_mean export:test/f1-score_std export:test/latency_mean export:test/latency_std export:test/e2e_time_mean export:test/e2e_time_std task
2.7.0dev DETECTION deim_dfine_l 71.7073 15.0686 1477.3654 3048.6643 8.2641 2.7069 0.5438 0.1741 0.6679 0.1308 0.5572 0.2019 0.122 0.0141 0.0753 0.0607 3.7902 3.9503 0.559 0.1992 0.196 0.1271 10.0849 8.485 DETECTION
2.7.0dev DETECTION deim_dfine_m 82.3902 21.246 1621.5693 3622.4548 6.6598 2.0274 0.4598 0.1824 0.6702 0.1201 0.5606 0.1843 0.0989 0.0144 0.0675 0.0577 3.4068 3.8209 0.5597 0.1834 0.154 0.1013 7.8983 6.8622 DETECTION
2.7.0dev DETECTION deim_dfine_x 58.4634 13.5538 1409.8564 2897.2094 11.7544 2.7089 0.6032 0.1312 0.633 0.1562 0.5144 0.222 0.124 0.0168 0.0956 0.0819 4.4298 3.8482 0.5187 0.2171 0.2634 0.1181 14.941 12.8748 DETECTION
2.7.0dev DETECTION deimv2_l 46.7805 13.1444 903.9662 1993.7543 9.0995 3.1126 0.4917 0.0971 0.6861 0.0971 0.6043 0.1395 0.102 0.0088 0.0757 0.062 3.7672 3.8452 0.604 0.1398 0.2569 0.1382 13.9423 11.5627 DETECTION
2.7.0dev DETECTION deimv2_m 58.9024 16.0262 1029.4484 2173.2948 7.9583 3.4933 0.4768 0.1521 0.688 0.1002 0.5868 0.1565 0.0925 0.0099 0.067 0.0557 3.3796 3.6302 0.5845 0.156 0.206 0.124 10.6745 8.5066 DETECTION
2.7.0dev DETECTION deimv2_s 56.1463 16.1006 966.5823 2220.3573 6.439 3.1682 0.4466 0.1447 0.6525 0.1213 0.565 0.1763 0.0854 0.011 0.0598 0.0503 3.0799 3.5161 0.5655 0.1756 0.1761 0.1232 8.5511 6.4467 DETECTION
2.7.0dev DETECTION deimv2_x 47.2195 14.4283 1270.6535 2880.1889 12.7359 3.6657 0.6237 0.1241 0.6931 0.0937 0.6038 0.1436 0.1169 0.0106 0.0899 0.076 4.2772 3.9346 0.603 0.1434 0.3214 0.1663 17.4233 13.9527 DETECTION
2.7.0dev DETECTION dfine_x 49.7317 18.3494 912.1143 1736.7324 10.3559 0.7046 0.5279 0.0389 0.6613 0.1277 0.5685 0.1751 0.1393 0.0092 0.0959 0.0786 4.5827 4.128 0.5686 0.175 0.2619 0.1257 14.6784 12.6574 DETECTION

kprokofi avatar Dec 10 '25 10:12 kprokofi

image

kprokofi avatar Dec 10 '25 10:12 kprokofi

DeimV2-S, DeimV2-M, DeimV2-L are recommended DETR models to expose in Geti Tune

kprokofi avatar Dec 10 '25 10:12 kprokofi

Final benchmark: (averaged across all datasets) image

Do you know why the large (L) model trains faster than the small (S)?

leoll2 avatar Dec 10 '25 15:12 leoll2

Final benchmark: (averaged across all datasets) image

Do you know why the large (L) model trains faster than the small (S)?

It doesn't train faster, however it converges faster. This chart represents e2e training time including early stopping. On average the L model needs less epochs to achieve high accuracy. Not 100% sure why, probably better match with hyperparameters

kprokofi avatar Dec 10 '25 17:12 kprokofi

DeimV2-S, DeimV2-M, DeimV2-L are recommended DETR models to expose in Geti Tune

I agree with this proposal, the S-M-L model show a good tradeoff of F1 and speed, while the X model is slower than L without improving the accuracy. Please go ahead with creating the manifests :)

leoll2 avatar Dec 11 '25 09:12 leoll2