TensorRT Question about INT8 quantization Slower

Description

yolov5s base pytorch-quantization

reference https://github.com/maggiez0138/yolov5_quant_sample

onnx->fp16 3ms qat->onnx->int8 4ms

why? please tell me，thanks.

Environment

TensorRT Version:8.2 NVIDIA GPU: geforce 3060ti NVIDIA Driver Version: 510 CUDA Version: 11.4 CUDNN Version: 7.2 Operating System: ubuntu 18.04 Python Version (if applicable): 3.8 Tensorflow Version (if applicable): PyTorch Version (if applicable): 1.10 Baremetal or Container (if so, version):

Jul 16 '22 06:07 tonyskypc

Your uploaded file only contains the QAT onnx and the pytorch pt model. Can you also upload the Non-QAT model for reproduce?

BTW can you try TRT 8.4?

my result on RTX-8000, TRT 8.4:

[07/17/2022-05:35:19] [I] === Performance summary ===
[07/17/2022-05:35:19] [I] Throughput: 595.29 qps
[07/17/2022-05:35:19] [I] Latency: min = 2.72711 ms, max = 2.78296 ms, mean = 2.74745 ms, median = 2.74561 ms, percentile(99%) = 2.76794 ms
[07/17/2022-05:35:19] [I] Enqueue Time: min = 0.685791 ms, max = 1.16644 ms, mean = 0.778225 ms, median = 0.752197 ms, percentile(99%) = 0.994141 ms
[07/17/2022-05:35:19] [I] H2D Latency: min = 0.407837 ms, max = 0.438232 ms, mean = 0.413101 ms, median = 0.410645 ms, percentile(99%) = 0.429932 ms
[07/17/2022-05:35:19] [I] GPU Compute Time: min = 1.66138 ms, max = 1.69775 ms, mean = 1.67327 ms, median = 1.67261 ms, percentile(99%) = 1.69336 ms
[07/17/2022-05:35:19] [I] D2H Latency: min = 0.654297 ms, max = 0.672729 ms, mean = 0.661076 ms, median = 0.660889 ms, percentile(99%) = 0.666748 ms
[07/17/2022-05:35:19] [I] Total Host Walltime: 3.00526 s
[07/17/2022-05:35:19] [I] Total GPU Compute Time: 2.99349 s
[07/17/2022-05:35:19] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/17/2022-05:35:19] [V]
[07/17/2022-05:35:19] [V] === Explanations of the performance metrics ===
[07/17/2022-05:35:19] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[07/17/2022-05:35:19] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[07/17/2022-05:35:19] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/17/2022-05:35:19] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/17/2022-05:35:19] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[07/17/2022-05:35:19] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[07/17/2022-05:35:19] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[07/17/2022-05:35:19] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[07/17/2022-05:35:19] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # TensorRT-8.4.1.5/bin/trtexec --onnx=yolov5s-qat-best.onnx --int8 --verbose

Jul 17 '22 12:07 zerollzeng

yolov5s non-qat download

Thanks! @zerollzeng

Jul 18 '22 06:07 tonyskypc

@zerollzeng HI，Do as you say。 Trt8.4.1 Non-QAT int8： [07/18/2022-16:17:27] [I] === Performance summary === [07/18/2022-16:17:27] [I] Throughput: 827.176 qps [07/18/2022-16:17:27] [I] Latency: min = 2.11108 ms, max = 2.95532 ms, mean = 2.21508 ms, median = 2.20471 ms, percentile(99%) = 2.47363 ms [07/18/2022-16:17:27] [I] Enqueue Time: min = 0.219482 ms, max = 1.52954 ms, mean = 0.525232 ms, median = 0.480347 ms, percentile(99%) = 1.25537 ms [07/18/2022-16:17:27] [I] H2D Latency: min = 0.389404 ms, max = 0.74585 ms, mean = 0.434748 ms, median = 0.431061 ms, percentile(99%) = 0.585449 ms [07/18/2022-16:17:27] [I] GPU Compute Time: min = 1.04541 ms, max = 1.88818 ms, mean = 1.0693 ms, median = 1.06393 ms, percentile(99%) = 1.2522 ms [07/18/2022-16:17:27] [I] D2H Latency: min = 0.656982 ms, max = 0.900696 ms, mean = 0.711038 ms, median = 0.708191 ms, percentile(99%) = 0.812744 ms [07/18/2022-16:17:27] [I] Total Host Walltime: 3.00299 s [07/18/2022-16:17:27] [I] Total GPU Compute Time: 2.65614 s [07/18/2022-16:17:27] [W] * GPU compute time is unstable, with coefficient of variance = 5.10254%. [07/18/2022-16:17:27] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability. [07/18/2022-16:17:27] [I] Explanations of the performance metrics are printed in the verbose logs. [07/18/2022-16:17:27] [V] [07/18/2022-16:17:27] [V] === Explanations of the performance metrics === [07/18/2022-16:17:27] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed. [07/18/2022-16:17:27] [V] GPU Compute Time: the GPU latency to execute the kernels for a query. [07/18/2022-16:17:27] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers. [07/18/2022-16:17:27] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers. [07/18/2022-16:17:27] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized. [07/18/2022-16:17:27] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query. [07/18/2022-16:17:27] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query. [07/18/2022-16:17:27] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query. [07/18/2022-16:17:27] [I] &&&& PASSED TensorRT.trtexec [TensorRT v8401] # /usr/local/tensorrt/bin/trtexec --onnx=./runs/finetune/weights/yolov5s.onnx --verbose --int8 --saveEngine=./runs/finetune/weights/yolov5s.trt

Non-QAT fp16： [07/18/2022-16:27:08] [I] === Performance summary === [07/18/2022-16:27:08] [I] Throughput: 728.59 qps [07/18/2022-16:27:08] [I] Latency: min = 2.38107 ms, max = 3.37402 ms, mean = 2.42724 ms, median = 2.41078 ms, percentile(99%) = 2.75977 ms [07/18/2022-16:27:08] [I] Enqueue Time: min = 0.290527 ms, max = 1.70911 ms, mean = 0.629934 ms, median = 0.604607 ms, percentile(99%) = 1.47644 ms [07/18/2022-16:27:08] [I] H2D Latency: min = 0.380188 ms, max = 0.590088 ms, mean = 0.391329 ms, median = 0.389038 ms, percentile(99%) = 0.47229 ms [07/18/2022-16:27:08] [I] GPU Compute Time: min = 1.32301 ms, max = 2.30298 ms, mean = 1.36393 ms, median = 1.35156 ms, percentile(99%) = 1.70996 ms [07/18/2022-16:27:08] [I] D2H Latency: min = 0.658081 ms, max = 0.876221 ms, mean = 0.671984 ms, median = 0.666626 ms, percentile(99%) = 0.75708 ms [07/18/2022-16:27:08] [I] Total Host Walltime: 3.00306 s [07/18/2022-16:27:08] [I] Total GPU Compute Time: 2.98428 s [07/18/2022-16:27:08] [W] * GPU compute time is unstable, with coefficient of variance = 5.64658%. [07/18/2022-16:27:08] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability. [07/18/2022-16:27:08] [I] Explanations of the performance metrics are printed in the verbose logs. [07/18/2022-16:27:08] [V] [07/18/2022-16:27:08] [V] === Explanations of the performance metrics === [07/18/2022-16:27:08] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed. [07/18/2022-16:27:08] [V] GPU Compute Time: the GPU latency to execute the kernels for a query. [07/18/2022-16:27:08] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers. [07/18/2022-16:27:08] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers. [07/18/2022-16:27:08] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized. [07/18/2022-16:27:08] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query. [07/18/2022-16:27:08] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query. [07/18/2022-16:27:08] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query. [07/18/2022-16:27:08] [I] &&&& PASSED TensorRT.trtexec [TensorRT v8401] # /usr/local/tensorrt/bin/trtexec --onnx=./runs/finetune/weights/yolov5s.onnx --verbose --fp16 --saveEngine=./runs/finetune/weights/yolov5s.trt

QAT int8：

[07/18/2022-16:13:21] [I] === Performance summary === [07/18/2022-16:13:21] [I] Throughput: 559.371 qps [07/18/2022-16:13:21] [I] Latency: min = 2.78833 ms, max = 3.84329 ms, mean = 2.85385 ms, median = 2.83472 ms, percentile(99%) = 3.34125 ms [07/18/2022-16:13:21] [I] Enqueue Time: min = 0.391846 ms, max = 1.96265 ms, mean = 0.780991 ms, median = 0.73999 ms, percentile(99%) = 1.67163 ms [07/18/2022-16:13:21] [I] H2D Latency: min = 0.383789 ms, max = 0.48999 ms, mean = 0.393992 ms, median = 0.393066 ms, percentile(99%) = 0.410309 ms [07/18/2022-16:13:21] [I] GPU Compute Time: min = 1.74097 ms, max = 2.76685 ms, mean = 1.7834 ms, median = 1.7644 ms, percentile(99%) = 2.28766 ms [07/18/2022-16:13:21] [I] D2H Latency: min = 0.655518 ms, max = 0.804138 ms, mean = 0.676455 ms, median = 0.675293 ms, percentile(99%) = 0.708923 ms [07/18/2022-16:13:21] [I] Total Host Walltime: 3.00516 s [07/18/2022-16:13:21] [I] Total GPU Compute Time: 2.9979 s [07/18/2022-16:13:21] [W] * GPU compute time is unstable, with coefficient of variance = 5.16926%. [07/18/2022-16:13:21] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability. [07/18/2022-16:13:21] [I] Explanations of the performance metrics are printed in the verbose logs. [07/18/2022-16:13:21] [V] [07/18/2022-16:13:21] [V] === Explanations of the performance metrics === [07/18/2022-16:13:21] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed. [07/18/2022-16:13:21] [V] GPU Compute Time: the GPU latency to execute the kernels for a query. [07/18/2022-16:13:21] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers. [07/18/2022-16:13:21] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers. [07/18/2022-16:13:21] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized. [07/18/2022-16:13:21] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query. [07/18/2022-16:13:21] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query. [07/18/2022-16:13:21] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query. [07/18/2022-16:13:21] [I] &&&& PASSED TensorRT.trtexec [TensorRT v8401] # /usr/local/tensorrt/bin/trtexec --onnx=./runs/finetune/weights/yolov5s-qat-best.onnx --verbose --int8 --saveEngine=./runs/finetune/weights/yolov5s-qat-best.trt

Jul 18 '22 08:07 tonyskypc

There is Engine Layer Information in your verbose log. you can check it and compare the final TRT engine structure and precision for each layer.

Also you can try adding --profilingVerbosity=detailed --verbose --dumpProfile --dumpLayerInfo --separateProfileRun which will print each layer's inference time.

Jul 18 '22 09:07 zerollzeng

What you said is too professional. I can't operate it yet。 Please watch : https://github.com/maggiez0138/yolov5_quant_sample about qat code error？ Point out specific points train_core_code and log download

Thank you！！

Jul 18 '22 09:07 tonyskypc

print each layer's inference time download，but I don't know how to solve it。 help me please，Thanks. @zerollzeng

Jul 19 '22 07:07 tonyskypc

i have the same issue, how can i slove it.

Aug 26 '22 12:08 VictorGump