TensorRT Inserting QDQ has severely impacted the performance of the unquantized Myelin part.

Description

I am performing QAT quantization on a complex model. When I insert Q/DQ nodes into the ResNet portion I want to quantize according to the rules, TensorRT can run this part in INT8 after building. How can I ensure that the parts without Q/DQ nodes run with optimal performance in non-INT8 precision (FP16 + FP32)? I noticed that after inserting Q/DQ nodes into a part of the complex network, the performance of the unquantized parts decreases compared to FP16.

I conducted an experiment where I inserted QDQ only before a single convolution layer and obtained the build result.

The result of building the same network in FP16 mode.

Why does the part within the green box perform differently?

Another question: Even if the input and output of Myelin are exactly the same in the two exported engines, the execution time differs significantly.

fp16 mode:

I'm confused about how I can ensure that the unquantized parts of my model run optimally in FP16 or FP32.

Environment

TensorRT Version: 8.5.2

NVIDIA GPU: orin / 3090

NVIDIA Driver Version:

CUDA Version: 11.4

CUDNN Version:

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

Dec 23 '24 07:12 zsh4614

Can you upload the two logs with trtexec --verbose --dumpProfile --dumpLayerInfo --separateProfileRun 2>&1 | tee log

Dec 28 '24 15:12 lix19937

Can you upload the two logs with trtexec --verbose --dumpProfile --dumpLayerInfo --separateProfileRun 2>&1 | tee log

before insert Q/DQ: log_fp16.log

after insert: log_qdq.log

Jan 02 '25 06:01 zsh4614

From your log, qat model has addsome reformat copy node.

If you only want resnet parts run in int8, others run fp16/fp32, some you can split the model: backend + head.

Jan 06 '25 06:01 lix19937