DEIM: TensorRT engine export with dynamic batches

Open mitzy-ByteMe opened this issue 11 months ago • 1 comments

Has anyone successfully exported DEIM to TensorRT or ONNX with dynamic batch sizes? While export_onnx.py and trtexec work for exporting the model, I get an error related to the model architecture ('/model/decoder/GatherElements') during batch inference with both the ONNX and TensorRT engine files. I used the following trtexec command for the export: trtexec --onnx=model.onnx --saveEngine=model.trt --minShapes=images:1x3x640x640,orig_target_sizes:1x2 --optShapes=images:1x3x640x640,orig_target_sizes:1x2 --maxShapes=images:32x3x640x640,orig_target_sizes:32x2 --fp16

My input shapes are correct (e.g., for a batch size of 2: images: torch.Size([2, 3, 640, 640]), orig_target_sizes: torch.Size([2, 2]))."

This is the error with onnx: 2025-03-21 09:46:45.304331171 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running GatherElements node. Name:'/model/decoder/GatherElements' Status Message: GatherElements op: 'indices' shape should have values within bounds of 'data' shape. Invalid value in indices shape is: 2

This is the error with trt: [03/21/2025-09:16:43] [TRT] [E] IExecutionContext::executeV2: Error Code 7: Internal Error (/model/decoder/GatherElements: The extent of dimension 0 of indices must be less than or equal to the extent of data. Condition '<' violated: 2 >= 1. Instruction: CHECK_LESS 2 1.)

Mar 21 '25 09:03 mitzy-ByteMe

My Fork
- This fork wholebody28 branch only has some parameters customized for learning Wholebody28. Also, ONNX/TensorRT optimization. https://github.com/PINTO0309/DEIM
For dynamic model custom
- https://github.com/PINTO0309/DEIM/commit/06f61981239fdc36152a25ceb0e5082159cc10f3
- https://github.com/PINTO0309/DEIM/commit/1913781500ad5afdcce7cbd8b8b85781aeeaef84
ONNX files
- https://github.com/PINTO0309/DEIM/releases/tag/onnx
DEIM dynamic batch (dynamic height, dynamic width) - ONNX

Inference test - [5, 3, 480, 640]

CUDA

sit4onnx -if deim_hgnetv2_s_wholebody28_ft_1250query_n_batch.onnx -oep cuda -fs 5 3 480 640

INFO: file: deim_hgnetv2_s_wholebody28_ft_1250query_n_batch.onnx
INFO: providers: ['CUDAExecutionProvider', 'CPUExecutionProvider']
INFO: input_name.1: input_bgr shape: [5, 3, 480, 640] dtype: float32
INFO: test_loop_count: 10
INFO: total elapsed time:  558.7770938873291 ms
INFO: avg elapsed time per pred:  55.87770938873291 ms
INFO: output_name.1: label_xyxy_score shape: [5, 1250, 6] dtype: float32

TensorRT

sit4onnx -if deim_hgnetv2_s_wholebody28_ft_1250query_n_batch.onnx -oep tensorrt -fs 5 3 480 640

2025-03-22 15:36:32.511025557 [W:onnxruntime:Default, tensorrt_execution_provider.h:86 log] [2025-03-22 06:36:32 WARNING] ModelImporter.cpp:787: Make sure output /model/decoder/decoder/lqe_layers.2/TopK_output_1 has Int64 binding.
2025-03-22 15:36:32.580633082 [W:onnxruntime:Default, tensorrt_execution_provider.h:86 log] [2025-03-22 06:36:32 WARNING] ModelImporter.cpp:787: Make sure output /model/decoder/decoder/lqe_layers.2/TopK_output_1 has Int64 binding.
INFO: file: deim_hgnetv2_s_wholebody28_ft_1250query_n_batch.onnx
INFO: providers: ['TensorrtExecutionProvider', 'CPUExecutionProvider']
INFO: input_name.1: input_bgr shape: [5, 3, 480, 640] dtype: float32
INFO: test_loop_count: 10
INFO: total elapsed time:  154.9851894378662 ms
INFO: avg elapsed time per pred:  15.498518943786621 ms
INFO: output_name.1: label_xyxy_score shape: [5, 1250, 6] dtype: float32

My playground
- https://github.com/PINTO0309/PINTO_model_zoo/tree/main/465_DEIM-Wholebody28

Good luck.

Mar 22 '25 06:03 PINTO0309