training_extensions Add DEIMV2 Object Detection Model

Summary

resolves #5015

[x] Add DinoV3 and VIT tiny as a backbones for detection, primarily for DeimV2 model
[x] Add DEIMV2 model (OTXModel, Encoder, Decoder), e2e training, export
[x] Experiment with pre-processing, Copy-blend, EMA, learning rate and its schedule, model weights
[x] Add Unit tests, perf tests
[x] Provide final benchmark numbers (vs other DETR variants)

How to test

otx train --config src/otx/recipe/detection/deimv2_l.yaml --data_root tests/assests/car_tree_bug

Checklist

[x] The PR title and description are clear and descriptive
[x] I have manually tested the changes
[x] All changes are covered by automated tests
[x] All related issues are linked to this PR (if applicable)
[x] Documentation has been updated (if applicable)

Nov 25 '25 13:11 kprokofi

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

:x: Patch coverage is 83.37838% with 246 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
.../native/models/common/layers/transformer_layers.py	73.96%	88 Missing :warning:
...ive/models/detection/necks/dfine_hybrid_encoder.py	72.41%	48 Missing :warning:
...kend/native/models/detection/heads/deim_decoder.py	91.86%	28 Missing :warning:
...kend/native/models/common/layers/position_embed.py	66.66%	26 Missing :warning:
...x/backend/native/models/common/backbones/dinov3.py	91.87%	16 Missing :warning:
...brary/src/otx/backend/native/models/utils/utils.py	14.28%	12 Missing :warning:
...end/native/models/detection/backbones/dinov3sta.py	87.91%	11 Missing :warning:
...c/otx/backend/native/models/modules/transformer.py	85.71%	5 Missing :warning:
...backend/native/models/common/losses/gfocal_loss.py	42.85%	4 Missing :warning:
library/src/otx/backend/native/models/base.py	25.00%	3 Missing :warning:
... and 2 more

:loudspeaker: Thoughts on this report? Let us know!

Dec 02 '25 22:12 codecov-commenter

Model Manifests to be updated after a decision regarding the DETR models we want to expose

Dec 02 '25 22:12 kprokofi

Final benchmark: (averaged across all datasets)

List of the datasets:

Number of runs: 5 (5 different seeds)

Dec 10 '25 09:12 kprokofi

otx_version task model training:epoch_mean training:epoch_std training:e2e_time_mean training:e2e_time_std training:gpu_mem_mean training:gpu_mem_std training:train/iter_time_mean training:train/iter_time_std training:val/f1-score_mean training:val/f1-score_std torch:test/f1-score_mean torch:test/f1-score_std torch:test/iter_time_mean torch:test/iter_time_std torch:test/latency_mean torch:test/latency_std torch:test/e2e_time_mean torch:test/e2e_time_std export:test/f1-score_mean export:test/f1-score_std export:test/latency_mean export:test/latency_std export:test/e2e_time_mean export:test/e2e_time_std task

2.7.0dev DETECTION deim_dfine_l 71.7073 15.0686 1477.3654 3048.6643 8.2641 2.7069 0.5438 0.1741 0.6679 0.1308 0.5572 0.2019 0.122 0.0141 0.0753 0.0607 3.7902 3.9503 0.559 0.1992 0.196 0.1271 10.0849 8.485 DETECTION

2.7.0dev DETECTION deim_dfine_m 82.3902 21.246 1621.5693 3622.4548 6.6598 2.0274 0.4598 0.1824 0.6702 0.1201 0.5606 0.1843 0.0989 0.0144 0.0675 0.0577 3.4068 3.8209 0.5597 0.1834 0.154 0.1013 7.8983 6.8622 DETECTION

2.7.0dev DETECTION deim_dfine_x 58.4634 13.5538 1409.8564 2897.2094 11.7544 2.7089 0.6032 0.1312 0.633 0.1562 0.5144 0.222 0.124 0.0168 0.0956 0.0819 4.4298 3.8482 0.5187 0.2171 0.2634 0.1181 14.941 12.8748 DETECTION

2.7.0dev DETECTION deimv2_l 46.7805 13.1444 903.9662 1993.7543 9.0995 3.1126 0.4917 0.0971 0.6861 0.0971 0.6043 0.1395 0.102 0.0088 0.0757 0.062 3.7672 3.8452 0.604 0.1398 0.2569 0.1382 13.9423 11.5627 DETECTION

2.7.0dev DETECTION deimv2_m 58.9024 16.0262 1029.4484 2173.2948 7.9583 3.4933 0.4768 0.1521 0.688 0.1002 0.5868 0.1565 0.0925 0.0099 0.067 0.0557 3.3796 3.6302 0.5845 0.156 0.206 0.124 10.6745 8.5066 DETECTION

2.7.0dev DETECTION deimv2_s 56.1463 16.1006 966.5823 2220.3573 6.439 3.1682 0.4466 0.1447 0.6525 0.1213 0.565 0.1763 0.0854 0.011 0.0598 0.0503 3.0799 3.5161 0.5655 0.1756 0.1761 0.1232 8.5511 6.4467 DETECTION

2.7.0dev DETECTION deimv2_x 47.2195 14.4283 1270.6535 2880.1889 12.7359 3.6657 0.6237 0.1241 0.6931 0.0937 0.6038 0.1436 0.1169 0.0106 0.0899 0.076 4.2772 3.9346 0.603 0.1434 0.3214 0.1663 17.4233 13.9527 DETECTION

2.7.0dev DETECTION dfine_x 49.7317 18.3494 912.1143 1736.7324 10.3559 0.7046 0.5279 0.0389 0.6613 0.1277 0.5685 0.1751 0.1393 0.0092 0.0959 0.0786 4.5827 4.128 0.5686 0.175 0.2619 0.1257 14.6784 12.6574 DETECTION

Dec 10 '25 10:12 kprokofi

DeimV2-S, DeimV2-M, DeimV2-L are recommended DETR models to expose in Geti Tune

Dec 10 '25 10:12 kprokofi

Final benchmark: (averaged across all datasets)

Do you know why the large (L) model trains faster than the small (S)?

Dec 10 '25 15:12 leoll2

Final benchmark: (averaged across all datasets)

Do you know why the large (L) model trains faster than the small (S)?

It doesn't train faster, however it converges faster. This chart represents e2e training time including early stopping. On average the L model needs less epochs to achieve high accuracy. Not 100% sure why, probably better match with hyperparameters

Dec 10 '25 17:12 kprokofi

DeimV2-S, DeimV2-M, DeimV2-L are recommended DETR models to expose in Geti Tune

I agree with this proposal, the S-M-L model show a good tradeoff of F1 and speed, while the X model is slower than L without improving the accuracy. Please go ahead with creating the manifests :)

Dec 11 '25 09:12 leoll2

otx_version	task	model	training:epoch_mean	training:epoch_std	training:e2e_time_mean	training:e2e_time_std	training:gpu_mem_mean	training:gpu_mem_std	training:train/iter_time_mean	training:train/iter_time_std	training:val/f1-score_mean	training:val/f1-score_std	torch:test/f1-score_mean	torch:test/f1-score_std	torch:test/iter_time_mean	torch:test/iter_time_std	torch:test/latency_mean	torch:test/latency_std	torch:test/e2e_time_mean	torch:test/e2e_time_std	export:test/f1-score_mean	export:test/f1-score_std	export:test/latency_mean	export:test/latency_std	export:test/e2e_time_mean	export:test/e2e_time_std	task
2.7.0dev	DETECTION	deim_dfine_l	71.7073	15.0686	1477.3654	3048.6643	8.2641	2.7069	0.5438	0.1741	0.6679	0.1308	0.5572	0.2019	0.122	0.0141	0.0753	0.0607	3.7902	3.9503	0.559	0.1992	0.196	0.1271	10.0849	8.485	DETECTION
2.7.0dev	DETECTION	deim_dfine_m	82.3902	21.246	1621.5693	3622.4548	6.6598	2.0274	0.4598	0.1824	0.6702	0.1201	0.5606	0.1843	0.0989	0.0144	0.0675	0.0577	3.4068	3.8209	0.5597	0.1834	0.154	0.1013	7.8983	6.8622	DETECTION
2.7.0dev	DETECTION	deim_dfine_x	58.4634	13.5538	1409.8564	2897.2094	11.7544	2.7089	0.6032	0.1312	0.633	0.1562	0.5144	0.222	0.124	0.0168	0.0956	0.0819	4.4298	3.8482	0.5187	0.2171	0.2634	0.1181	14.941	12.8748	DETECTION
2.7.0dev	DETECTION	deimv2_l	46.7805	13.1444	903.9662	1993.7543	9.0995	3.1126	0.4917	0.0971	0.6861	0.0971	0.6043	0.1395	0.102	0.0088	0.0757	0.062	3.7672	3.8452	0.604	0.1398	0.2569	0.1382	13.9423	11.5627	DETECTION
2.7.0dev	DETECTION	deimv2_m	58.9024	16.0262	1029.4484	2173.2948	7.9583	3.4933	0.4768	0.1521	0.688	0.1002	0.5868	0.1565	0.0925	0.0099	0.067	0.0557	3.3796	3.6302	0.5845	0.156	0.206	0.124	10.6745	8.5066	DETECTION
2.7.0dev	DETECTION	deimv2_s	56.1463	16.1006	966.5823	2220.3573	6.439	3.1682	0.4466	0.1447	0.6525	0.1213	0.565	0.1763	0.0854	0.011	0.0598	0.0503	3.0799	3.5161	0.5655	0.1756	0.1761	0.1232	8.5511	6.4467	DETECTION
2.7.0dev	DETECTION	deimv2_x	47.2195	14.4283	1270.6535	2880.1889	12.7359	3.6657	0.6237	0.1241	0.6931	0.0937	0.6038	0.1436	0.1169	0.0106	0.0899	0.076	4.2772	3.9346	0.603	0.1434	0.3214	0.1663	17.4233	13.9527	DETECTION
2.7.0dev	DETECTION	dfine_x	49.7317	18.3494	912.1143	1736.7324	10.3559	0.7046	0.5279	0.0389	0.6613	0.1277	0.5685	0.1751	0.1393	0.0092	0.0959	0.0786	4.5827	4.128	0.5686	0.175	0.2619	0.1257	14.6784	12.6574	DETECTION