TRT inference poor performance v.s. pytorch with dino model
train model : dino link
firstly, use mmdeploy convert pytorch model to onnx format,
secondly, use Trt builder to generate engine.
finally, use execute_async_v2 method to inference, but result performance is too bad compared to pytorch.
nsight profilling is below, forward time is about 420ms+, but pytorch infer time is about
but pytorch infer time is about 180ms, nsys files is below
my question is what is the problem ? how to further analyze the performance and optimization ?
btw, below is my trt inference code, please check. thanks.
from PIL import Image
import numpy as np
import pycuda.driver as cuda
import tensorrt as trt
import cv2
import ctypes
TRT_LOGGER=trt.Logger(trt.Logger.WARNING)
def allocate_buffers(engine):
h_input=cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)),dtype=trt.nptype(trt.float32))
h_output1=cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(1)),dtype=trt.nptype(trt.float32))
h_output2=cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(2)),dtype=trt.nptype(trt.float32))
h_output3=cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(3)),dtype=trt.nptype(trt.float32))
h_output4=cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(4)),dtype=trt.nptype(trt.float32))
h_output5=cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(5)),dtype=trt.nptype(trt.float32))
d_input=cuda.mem_alloc(h_input.nbytes)
d_output1=cuda.mem_alloc(h_output1.nbytes)
d_output2=cuda.mem_alloc(h_output2.nbytes)
d_output3=cuda.mem_alloc(h_output3.nbytes)
d_output4=cuda.mem_alloc(h_output4.nbytes)
d_output5=cuda.mem_alloc(h_output5.nbytes)
stream=cuda.Stream()
return h_input,d_input,h_output1,d_output1,h_output2,d_output2,h_output3,d_output3,h_output4,d_output4,h_output5,d_output5,stream
def do_inference(context,h_input,d_input,h_output1,d_output1,h_output2,d_output2,h_output3,d_output3,h_output4,d_output4,h_output5,d_output5,stream):
cuda.memcpy_htod_async(d_input,h_input,stream)
context.execute_async_v2(bindings=[int(d_input),int(d_output1),int(d_output2),int(d_output3),int(d_output4),int(d_output5)],stream_handle=stream.handle)
cuda.memcpy_dtoh_async(h_output1,d_output1,stream)
cuda.memcpy_dtoh_async(h_output2,d_output2,stream)
cuda.memcpy_dtoh_async(h_output3,d_output3,stream)
cuda.memcpy_dtoh_async(h_output4,d_output4,stream)
cuda.memcpy_dtoh_async(h_output5,d_output5,stream)
stream.synchronize()
def load_normalized_test_case(test_image,pagelocked_buffer):
def normalize_image(image):
img_src=cv2.imread(image)
resized=cv2.resize(img_src,(750,1333),interpolation=cv2.INTER_LINEAR)
img_in=cv2.cvtColor(resized,cv2.COLOR_BGR2RGB)
img_in=np.transpose(img_in,(2,0,1)).astype(np.float32)
img_in=np.expand_dims(img_in,axis=0)
img_in/=255.0
img_out=img_in.flatten()
return img_out
np.copyto(pagelocked_buffer,normalize_image(test_image))
def load_engine(engine_path):
with open(engine_path,'rb') as f:
runtime=trt.Runtime(TRT_LOGGER)
runtime.max_threads=10
engine_data=f.read()
return runtime.deserialize_cuda_engine(engine_data)
def build_engine():
with trt.Builder(TRT_LOGGER) as builder,builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) as network,trt.OnnxParser(network,TRT_LOGGER) as parser:
config = builder.create_builder_config()
config.max_workspace_size = 1 << 30
config.set_flag(trt.BuilderFlag.FP16)
with open("./end2end.onnx",'rb') as model:
if not parser.parse(model.read()):
for error in range(parser.num_errors):
print(parser.get_error(error))
engine=builder.build_engine(network, config)
engine_file="./end2end.engine"
if engine_file:
with open(engine_file,'wb') as f:
f.write(engine.serialize())
return engine
def main():
test_image="./1.jpg"
#build_engine()
with load_engine("./end2end.engine") as engine:
h_input,d_input,h_output1,d_output1,h_output2,d_output2,h_output3,d_output3,h_output4,d_output4,h_output5,d_output5,stream=allocate_buffers(engine)
import torch.cuda.nvtx as nvtx
nvtx.range_push("prepare Data")
load_normalized_test_case(test_image,h_input)
nvtx.range_pop()
with engine.create_execution_context() as context:
for i in range(100):
nvtx.range_push("Forward")
do_inference(context,h_input,d_input,h_output1,d_output1,h_output2,d_output2,h_output3,d_output3,h_output4,d_output4,h_output5,d_output5,stream)
nvtx.range_pop()
if __name__=='__main__':
lib_path="./libmmdeploy_tensorrt_ops.so"
ctypes.CDLL(lib_path)
trt.init_libnvinfer_plugins(TRT_LOGGER,"")
main()
- Can you share the onnx and plugin.so here for quick reproduce?
- Which GPU you are using? also Nvidia Driver version and CUDA version etc. please provide these follow our bug template.
If possible please use trtexec to benchmark the TRT performance, a sample command would be like trtexec --onnx=model.onnx --plugins=./libmmdeploy_tensorrt_ops.so --fp16
- Can you share the onnx and plugin.so here for quick reproduce?
- Which GPU you are using? also Nvidia Driver version and CUDA version etc. please provide these follow our bug template.
@zerollzeng
- I uploaded my onnx file and plugin.so, you can download from https://drive.google.com/file/d/11woAWMIUNf3VYO2-hdZtmIk7udQ_lA18/view?usp=drive_link
- I use A100 & cuda 12.2
thanks.
I‘ve requested access.
Check with TRT 8.6(TRT docker 23.10) on A100. the mean gpu time is 102.96ms. So this doesn't looks like the bug in TRT
[11/14/2023-11:22:42] [I] H2D Latency: min = 0.717041 ms, max = 0.864014 ms, mean = 0.803251 ms, median = 0.814941 ms, percentile(90%) = 0.8479 ms, percentile(95%) = 0.861694 ms, percentile(99%) = 0.864014 ms
[11/14/2023-11:22:42] [I] GPU Compute Time: min = 102.208 ms, max = 103.756 ms, mean = 102.96 ms, median = 102.987 ms, percentile(90%) = 103.37 ms, percentile(95%) = 103.458 ms, percentile(99%) = 103.756 ms
[11/14/2023-11:22:42] [I] D2H Latency: min = 0.0119629 ms, max = 0.0164185 ms, mean = 0.0146327 ms, median = 0.0146484 ms, percentile(90%) = 0.0161133 ms, percentile(95%) = 0.0163574 ms, percentile(99%) = 0.0164185 ms
...
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=end2end.onnx --plugins=./libmmdeploy_tensorrt_ops_cuda12.so --dumpProfile --separateProfileRun
Check with TRT 8.6(TRT docker 23.10) on A100. the mean gpu time is 102.96ms. So this doesn't looks like the bug in TRT
[11/14/2023-11:22:42] [I] H2D Latency: min = 0.717041 ms, max = 0.864014 ms, mean = 0.803251 ms, median = 0.814941 ms, percentile(90%) = 0.8479 ms, percentile(95%) = 0.861694 ms, percentile(99%) = 0.864014 ms [11/14/2023-11:22:42] [I] GPU Compute Time: min = 102.208 ms, max = 103.756 ms, mean = 102.96 ms, median = 102.987 ms, percentile(90%) = 103.37 ms, percentile(95%) = 103.458 ms, percentile(99%) = 103.756 ms [11/14/2023-11:22:42] [I] D2H Latency: min = 0.0119629 ms, max = 0.0164185 ms, mean = 0.0146327 ms, median = 0.0146484 ms, percentile(90%) = 0.0161133 ms, percentile(95%) = 0.0163574 ms, percentile(99%) = 0.0164185 ms ... &&&& PASSED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=end2end.onnx --plugins=./libmmdeploy_tensorrt_ops_cuda12.so --dumpProfile --separateProfileRun
But the performance of fp16 and tf32 is basically the same, is this normal? It doesn't seem to meet expectations very well. @zerollzeng
Hi @chenrui17 , did you figure it out by any chance? I'm running into the same problem although it is an old issue