TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

Inserting QDQ has severely impacted the performance of the unquantized Myelin part.

Open zsh4614 opened this issue 1 year ago • 3 comments

Description

I am performing QAT quantization on a complex model. When I insert Q/DQ nodes into the ResNet portion I want to quantize according to the rules, TensorRT can run this part in INT8 after building. How can I ensure that the parts without Q/DQ nodes run with optimal performance in non-INT8 precision (FP16 + FP32)? I noticed that after inserting Q/DQ nodes into a part of the complex network, the performance of the unquantized parts decreases compared to FP16.

I conducted an experiment where I inserted QDQ only before a single convolution layer and obtained the build result. Image

The result of building the same network in FP16 mode. Image

Why does the part within the green box perform differently?

Another question: Even if the input and output of Myelin are exactly the same in the two exported engines, the execution time differs significantly. Image

fp16 mode: Image

I'm confused about how I can ensure that the unquantized parts of my model run optimally in FP16 or FP32.

Environment

TensorRT Version: 8.5.2

NVIDIA GPU: orin / 3090

NVIDIA Driver Version:

CUDA Version: 11.4

CUDNN Version:

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

zsh4614 avatar Dec 23 '24 07:12 zsh4614

Can you upload the two logs with trtexec --verbose --dumpProfile --dumpLayerInfo --separateProfileRun 2>&1 | tee log

lix19937 avatar Dec 28 '24 15:12 lix19937

Can you upload the two logs with trtexec --verbose --dumpProfile --dumpLayerInfo --separateProfileRun 2>&1 | tee log

before insert Q/DQ: log_fp16.log

after insert: log_qdq.log

zsh4614 avatar Jan 02 '25 06:01 zsh4614

From your log, qat model has addsome reformat copy node.

If you only want resnet parts run in int8, others run fp16/fp32, some you can split the model: backend + head.

lix19937 avatar Jan 06 '25 06:01 lix19937