TensorRT TensorRT8.6.1.6 Inference cost too much time

Description

I used tensorRT8.6.1.6 to implement yolov8 inference.I found a problem and it was confused. when i set batchsize from 1 to 12, the inference time was also increased like , batchsize:1 ,time: 10ms; batchsize:2,time: 20ms, ... until batchsize:12, time: 120ms. it seems like the model inference images one by one, not as a whole to inference. is it normal? In my view, if batchsize 2 cost 20ms, then batchsize 4 should also cost 20ms. Cuda should parallel processing. I do not know how to solve this problem. Could someone give me one demo to help me implement this idea.

Environment

TensorRT Version: 8.6.1.6

NVIDIA GPU: RTX A4000

NVIDIA Driver Version:

CUDA Version: 11.6

CUDNN Version:

Operating System: windows

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

Jul 09 '24 07:07 kaixiangjin

it seems like the model inference images one by one, not as a whole to inference.

Parallel processing is only performed when there is still surplus GPU resources, otherwise it is considered serial execution.

Jul 10 '24 01:07 lix19937

it seems like the model inference images one by one, not as a whole to inference.

Parallel processing is only performed when there is still surplus GPU resources, otherwise it is considered serial execution.

How do i know if the GPU resources is not enough? Can i compute it?

Jul 10 '24 02:07 kaixiangjin

GPU resources contains lots of things: register, l1, l2, memory bandwidth, shm, cuda core/tensor core etc. Usually need do experiments.

It can be roughly viewed through nvidia-smi to see gpu util.
On the other hand, model has many layers(each layer has some cuda kernel), so among layers can be parallel.

Jul 10 '24 05:07 lix19937

GPU resources contains lots of things: register, l1, l2, memory bandwidth, shm, cuda core/tensor core etc. Usually need do experiments.

It can be roughly viewed through nvidia-smi to see gpu util. On the other hand, model has many layers(each layer has some cuda kernel), so among layers can be parallel.

I check my model and GPU. I think my GPU has enough resource. My GPU is RTXA4000 and model is yolov8s. Even if I use 224,224 as input size. This phenomenon still exists.

Jul 10 '24 05:07 kaixiangjin

What is your benchmark command or code ?

Jul 10 '24 15:07 lix19937

I had the same problem. The inference time for batch size of 32 is about 32X larger than that for batch size of 1. But the same model using TensorFlow-TensorRT behaves as expected. The hardware and environment are the same in a Nvidia TensorFlow Container released in 2401. Here is the benchmark command. trtexec --onnx=./tmp.onnx --saveEngine=./tmp.trt --shapes='input1':32x256x256x1,input2:32x256x256x1

Jul 11 '24 09:07 xxHn-pro

@xxHn-pro How do you metric time ?

Jul 11 '24 11:07 lix19937

``In TensorRT, it is in the log output. I take "GPU Compute Time" as the inference time.

[07/11/2024-09:44:35] [I] === Performance summary === [07/11/2024-09:44:35] [I] Throughput: 15.4965 qps [07/11/2024-09:44:35] [I] Latency: min = 63.6447 ms, max = 65.3091 ms, mean = 64.3707 ms, median = 64.2803 ms, percentile(90%) = 65.0261 ms, percentile(95%) = 65.238 ms, percentile(99%) = 65.3091 ms [07/11/2024-09:44:35] [I] Enqueue Time: min = 0.492401 ms, max = 0.9552 ms, mean = 0.859185 ms, median = 0.863281 ms, percentile(90%) = 0.917953 ms, percentile(95%) = 0.927368 ms, percentile(99%) = 0.9552 ms [07/11/2024-09:44:35] [I] H2D Latency: min = 1.3772 ms, max = 1.38623 ms, mean = 1.37917 ms, median = 1.37891 ms, percentile(90%) = 1.38007 ms, percentile(95%) = 1.38232 ms, percentile(99%) = 1.38623 ms [07/11/2024-09:44:35] [I] GPU Compute Time: min = 59.7156 ms, max = 61.3806 ms, mean = 60.4414 ms, median = 60.3503 ms, percentile(90%) = 61.0979 ms, percentile(95%) = 61.3088 ms, percentile(99%) = 61.3806 ms [07/11/2024-09:44:35] [I] D2H Latency: min = 2.54977 ms, max = 2.55176 ms, mean = 2.55008 ms, median = 2.55005 ms, percentile(90%) = 2.55029 ms, percentile(95%) = 2.55054 ms, percentile(99%) = 2.55176 ms [07/11/2024-09:44:35] [I] Total Host Walltime: 3.03294 s [07/11/2024-09:44:35] [I] Total GPU Compute Time: 2.84075 s

In TensorFlow-TensorRT, the code is run in python and the inference time is measured as below.

import tensorflow as tf
from tensorflow.python.saved_model import signature_constants, tag_constants
import time

def LoadRT(saved_model_dir):
   saved_model_loaded = tf.saved_model.load(
       saved_model_dir, tags=[tag_constants.SERVING])
   graph_func = saved_model_loaded.signatures[
       signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
   return graph_func, saved_model_loaded

model, _ = LoadRT(ModelName)

start_time = time.time()
pred = model(**InputData)
TimeIt = time.time() - start_time
return pred, TimeIt

Jul 11 '24 12:07 xxHn-pro

I reproduce the problem with an open model from here. Here is the result. The scale of time is about 1.7 with double batch size. Is that normal? I believe that the hardware (A100) is strong enough to handle these batch size in parallel.

BatchSize	4	8	16	32	64
Time(ms)	1.41	2.27	3.84	7.11	13.40
Scale	-	1.6099	1.6916	1.8516	1.8847

Here is the info about the container:

================

== TensorFlow ==

NVIDIA Release 24.01-tf2 (build 78846615) TensorFlow Version 2.14.0

Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. Copyright 2017-2023 The TensorFlow Authors. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED. Using CUDA 12.3 driver version 545.23.08 with kernel driver version 525.60.13. See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

NOTE: Mellanox network driver detected, but NVIDIA peer memory driver not detected. Multi-node communication performance may be reduced.

The test was done with

trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --shapes=data:4x3x224x224  > log4.txt
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --shapes=data:8x3x224x224  > log8.txt
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --shapes=data:16x3x224x224  > log16.txt
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --shapes=data:32x3x224x224  > log32.txt
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --shapes=data:64x3x224x224  > log64.txt

The log32.txt is provided here. log32.txt

Any advice or suggestion will be appreciate.

Jul 12 '24 09:07 xxHn-pro

@lix19937 Can you tell me something to try? Or common on the results please.

Jul 17 '24 08:07 xxHn-pro

@xxHn-pro dynamic shape model need set min-opt-max shape

  --minShapes=spec            Build with dynamic shapes using a profile with the min shapes provided
  --optShapes=spec            Build with dynamic shapes using a profile with the opt shapes provided
  --maxShapes=spec            Build with dynamic shapes using a profile with the max shapes provided

Jul 20 '24 03:07 lix19937

I have tried these commands.

trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --minShapes=data:8x3x224x224 --optShapes=data:8x3x224x224 --maxShapes=data:8x3x224x224 > log8.txt
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --minShapes=data:16x3x224x224 --optShapes=data:16x3x224x224 --maxShapes=data:16x3x224x224 > log16.txt

trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --minShapes=data:4x3x224x224 --optShapes=data:8x3x224x224 --maxShapes=data:16x3x224x224 > log8.txt
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --minShapes=data:8x3x224x224 --optShapes=data:16x3x224x224 --maxShapes=data:32x3x224x224 > log16.txt

But the results are the same as before.

Jul 21 '24 09:07 xxHn-pro

Can you upload the resnet50-v2-7.onnx file ?

Jul 21 '24 11:07 lix19937

The onnx file can be obtained from https://github.com/onnx/models/blob/main/validated/vision/classification/resnet/model/resnet50-v2-7.onnx

Jul 22 '24 02:07 xxHn-pro

My env as follow:

=== Device Information === Selected Device: NVIDIA RTX 2000 Ada Generation Laptop GPU Compute Capability: 8.9 SMs: 24 Compute Clock Rate: 2.115 GHz Device Global Memory: 8187 MiB Shared Memory per SM: 100 KiB Memory Bus Width: 128 bits (ECC disabled) Memory Clock Rate: 8.001 GHz

Ubuntu: 20.04 x86_64
TensorRT: 8510

Make sure that there are no other tasks on the machine during compilation.

Benchmark script


#!/bin/bash

data_pairs=(1,2,3,4,5)

for pair in "${data_pairs[@]}"; do
    IFS=',' read -r bz <<< "$pair"
    trtexec --onnx=./resnet50-v2-7.onnx  \
    --saveEngine=./resnet50-v2-7_$bz.plan \
    --minShapes=data:$bzx3x224x224 \
    --optShapes=data:$bzx3x224x224 \
    --maxShapes=data:$bzx3x224x224 \
    --verbose --dumpProfile --noDataTransfers --useCudaGraph --useSpinWait --separateProfileRun
done

Metric summary

batch size	latency:ms
1	1.78
2	2.65
3	3.35
4	4.16
5	4.89

Obviously, multiple batches have increased speed.
For trt, it can be enabled fp16 to obtain better efficiency, For reference.

Jul 29 '24 12:07 lix19937

Thanks for the testing.

Aug 02 '24 07:08 xxHn-pro

@lix19937 I ran the above command line and the following error occurred. What's going on? How to check the inference time? #!/bin/bash ./trtexec --onnx=./model.onnx
--saveEngine=./model_1.plan
--minShapes=data:1x3x256x256
--optShapes=data:1x3x256x256
--maxShapes=data:1x3x256x256
--verbose --dumpProfile --noDataTransfers --useCudaGraph --useSpinWait --separateProfileRun > 01-run_onnx.log 2>&1

[12/10/2024-16:19:51] [V] [TRT] Registering layer: /model/Add for ONNX node: /model/Add [12/10/2024-16:19:51] [V] [TRT] Registering tensor: output_133 for ONNX tensor: output [12/10/2024-16:19:51] [V] [TRT] /model/Add [Add] outputs: [output -> (-1, 1, 256, 256)[FLOAT]], [12/10/2024-16:19:51] [V] [TRT] Marking output_133 as output: output [12/10/2024-16:19:51] [V] [TRT] Marking onnx::Mul_427_131 as output: onnx::Mul_427 [12/10/2024-16:19:51] [V] [TRT] Marking onnx::Mul_432_132 as output: onnx::Mul_432 [12/10/2024-16:19:51] [I] Finish parsing network model [12/10/2024-16:19:51] [E] Cannot find input tensor with name "data" in the network inputs! Please make sure the input tensor names are correct. [12/10/2024-16:19:51] [E] Network And Config setup failed [12/10/2024-16:19:51] [E] Building engine failed [12/10/2024-16:19:51] [E] Failed to create engine from model or file. [12/10/2024-16:19:51] [E] Engine set up failed &&&& FAILED TensorRT.trtexec [TensorRT v8503] # ./trtexec --onnx=./model.onnx --saveEngine=./model_1.plan --minShapes=data:1x3x256x256 --optShapes=data:1x3x256x256 --maxShapes=data:1x3x256x256 --verbose --dumpProfile --noDataTransfers --useCudaGraph --useSpinWait --separateProfileRun

Dec 10 '24 08:12 watertianyi

Please check the output of your command carefully. It is clear and easy to read.

Dec 11 '24 04:12 xxHn-pro

Thank you. There is indeed an error in the data field. It is normal after changing it to input.

Dec 11 '24 06:12 watertianyi

@kaixiangjin closing this ticket - please re-open if this is still reproducible on TensorRT 10.8, thanks.

Feb 11 '25 15:02 brnguyen2