TensorRT TRT inference poor performance v.s. pytorch with dino model

train model : dino link firstly, use mmdeploy convert pytorch model to onnx format, secondly, use Trt builder to generate engine. finally, use execute_async_v2 method to inference, but result performance is too bad compared to pytorch. nsight profilling is below, forward time is about 420ms+, but pytorch infer time is about but pytorch infer time is about 180ms, nsys files is below

my question is what is the problem ? how to further analyze the performance and optimization ?

btw, below is my trt inference code, please check. thanks.

from PIL import Image
import numpy as np
import pycuda.driver as cuda
import tensorrt as trt
import cv2
import ctypes

TRT_LOGGER=trt.Logger(trt.Logger.WARNING)

def allocate_buffers(engine):
    h_input=cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)),dtype=trt.nptype(trt.float32))
    h_output1=cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(1)),dtype=trt.nptype(trt.float32))
    h_output2=cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(2)),dtype=trt.nptype(trt.float32))
    h_output3=cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(3)),dtype=trt.nptype(trt.float32))
    h_output4=cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(4)),dtype=trt.nptype(trt.float32))
    h_output5=cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(5)),dtype=trt.nptype(trt.float32))
    d_input=cuda.mem_alloc(h_input.nbytes)
    d_output1=cuda.mem_alloc(h_output1.nbytes)
    d_output2=cuda.mem_alloc(h_output2.nbytes)
    d_output3=cuda.mem_alloc(h_output3.nbytes)
    d_output4=cuda.mem_alloc(h_output4.nbytes)
    d_output5=cuda.mem_alloc(h_output5.nbytes)
    stream=cuda.Stream()
    return h_input,d_input,h_output1,d_output1,h_output2,d_output2,h_output3,d_output3,h_output4,d_output4,h_output5,d_output5,stream

def do_inference(context,h_input,d_input,h_output1,d_output1,h_output2,d_output2,h_output3,d_output3,h_output4,d_output4,h_output5,d_output5,stream):
    cuda.memcpy_htod_async(d_input,h_input,stream)
    context.execute_async_v2(bindings=[int(d_input),int(d_output1),int(d_output2),int(d_output3),int(d_output4),int(d_output5)],stream_handle=stream.handle)
    cuda.memcpy_dtoh_async(h_output1,d_output1,stream)
    cuda.memcpy_dtoh_async(h_output2,d_output2,stream)
    cuda.memcpy_dtoh_async(h_output3,d_output3,stream)
    cuda.memcpy_dtoh_async(h_output4,d_output4,stream)
    cuda.memcpy_dtoh_async(h_output5,d_output5,stream)
    stream.synchronize()

def load_normalized_test_case(test_image,pagelocked_buffer):
    def normalize_image(image):
        img_src=cv2.imread(image)
        resized=cv2.resize(img_src,(750,1333),interpolation=cv2.INTER_LINEAR)
        img_in=cv2.cvtColor(resized,cv2.COLOR_BGR2RGB)
        img_in=np.transpose(img_in,(2,0,1)).astype(np.float32)
        img_in=np.expand_dims(img_in,axis=0)
        img_in/=255.0
        img_out=img_in.flatten()
        return img_out
    np.copyto(pagelocked_buffer,normalize_image(test_image))

def load_engine(engine_path):
    with open(engine_path,'rb') as f:
        runtime=trt.Runtime(TRT_LOGGER)
        runtime.max_threads=10
        engine_data=f.read()
        return runtime.deserialize_cuda_engine(engine_data)
    
def build_engine():
    with trt.Builder(TRT_LOGGER) as builder,builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) as network,trt.OnnxParser(network,TRT_LOGGER) as parser:
        config = builder.create_builder_config()
        config.max_workspace_size = 1 << 30
        config.set_flag(trt.BuilderFlag.FP16)
        
        with open("./end2end.onnx",'rb') as model:
            if not parser.parse(model.read()):
                for error in range(parser.num_errors):
                    print(parser.get_error(error))
        
        engine=builder.build_engine(network, config)
        
        engine_file="./end2end.engine"
        if engine_file:
            with open(engine_file,'wb') as f:
                f.write(engine.serialize())
        
        return engine

def main():
    test_image="./1.jpg"
    #build_engine()
    with load_engine("./end2end.engine") as engine:
        
        h_input,d_input,h_output1,d_output1,h_output2,d_output2,h_output3,d_output3,h_output4,d_output4,h_output5,d_output5,stream=allocate_buffers(engine)
        import torch.cuda.nvtx as nvtx
        nvtx.range_push("prepare Data")
        load_normalized_test_case(test_image,h_input)
        nvtx.range_pop()
        with engine.create_execution_context() as context:
            for i in range(100):
                nvtx.range_push("Forward")
                do_inference(context,h_input,d_input,h_output1,d_output1,h_output2,d_output2,h_output3,d_output3,h_output4,d_output4,h_output5,d_output5,stream)
                nvtx.range_pop()

if __name__=='__main__':
    lib_path="./libmmdeploy_tensorrt_ops.so"
    ctypes.CDLL(lib_path)
    trt.init_libnvinfer_plugins(TRT_LOGGER,"") 
    main()

Oct 23 '23 07:10 chenrui17

Can you share the onnx and plugin.so here for quick reproduce?
Which GPU you are using? also Nvidia Driver version and CUDA version etc. please provide these follow our bug template.

Oct 25 '23 13:10 zerollzeng

If possible please use trtexec to benchmark the TRT performance, a sample command would be like trtexec --onnx=model.onnx --plugins=./libmmdeploy_tensorrt_ops.so --fp16

Oct 25 '23 13:10 zerollzeng

Can you share the onnx and plugin.so here for quick reproduce?

Which GPU you are using? also Nvidia Driver version and CUDA version etc. please provide these follow our bug template.

@zerollzeng

I uploaded my onnx file and plugin.so, you can download from https://drive.google.com/file/d/11woAWMIUNf3VYO2-hdZtmIk7udQ_lA18/view?usp=drive_link
I use A100 & cuda 12.2

thanks.

Nov 14 '23 00:11 chenrui17

I‘ve requested access.

Nov 14 '23 09:11 zerollzeng

Check with TRT 8.6(TRT docker 23.10) on A100. the mean gpu time is 102.96ms. So this doesn't looks like the bug in TRT

[11/14/2023-11:22:42] [I] H2D Latency: min = 0.717041 ms, max = 0.864014 ms, mean = 0.803251 ms, median = 0.814941 ms, percentile(90%) = 0.8479 ms, percentile(95%) = 0.861694 ms, percentile(99%) = 0.864014 ms
[11/14/2023-11:22:42] [I] GPU Compute Time: min = 102.208 ms, max = 103.756 ms, mean = 102.96 ms, median = 102.987 ms, percentile(90%) = 103.37 ms, percentile(95%) = 103.458 ms, percentile(99%) = 103.756 ms
[11/14/2023-11:22:42] [I] D2H Latency: min = 0.0119629 ms, max = 0.0164185 ms, mean = 0.0146327 ms, median = 0.0146484 ms, percentile(90%) = 0.0161133 ms, percentile(95%) = 0.0163574 ms, percentile(99%) = 0.0164185 ms
...
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=end2end.onnx --plugins=./libmmdeploy_tensorrt_ops_cuda12.so --dumpProfile --separateProfileRun

Nov 14 '23 11:11 zerollzeng

Check with TRT 8.6(TRT docker 23.10) on A100. the mean gpu time is 102.96ms. So this doesn't looks like the bug in TRT

[11/14/2023-11:22:42] [I] H2D Latency: min = 0.717041 ms, max = 0.864014 ms, mean = 0.803251 ms, median = 0.814941 ms, percentile(90%) = 0.8479 ms, percentile(95%) = 0.861694 ms, percentile(99%) = 0.864014 ms
[11/14/2023-11:22:42] [I] GPU Compute Time: min = 102.208 ms, max = 103.756 ms, mean = 102.96 ms, median = 102.987 ms, percentile(90%) = 103.37 ms, percentile(95%) = 103.458 ms, percentile(99%) = 103.756 ms
[11/14/2023-11:22:42] [I] D2H Latency: min = 0.0119629 ms, max = 0.0164185 ms, mean = 0.0146327 ms, median = 0.0146484 ms, percentile(90%) = 0.0161133 ms, percentile(95%) = 0.0163574 ms, percentile(99%) = 0.0164185 ms
...
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=end2end.onnx --plugins=./libmmdeploy_tensorrt_ops_cuda12.so --dumpProfile --separateProfileRun

But the performance of fp16 and tf32 is basically the same, is this normal? It doesn't seem to meet expectations very well. @zerollzeng

Nov 21 '23 02:11 chenrui17

Hi @chenrui17 , did you figure it out by any chance? I'm running into the same problem although it is an old issue

Mar 06 '25 01:03 Desperado721