Explicit INT8 Quantization Fails to Fuse Concat-Conv Block Compared to Implicit Mode
Description
I am currently trying to quantize my UNet-based pytorch model to INT8. I end up with a slightly different —and slower— engine graph compared to the one generated by implicit quantization.
More specifically, I’m struggling to properly quantize the Concat-Conv section of the network. I’ve tried all suggested Q/DQ placements from various issues (and more), but TensorRT never fully fuses the operators in that section. Since I can’t seem to find an official Q/DQ placement guide in the documentation anymore, I’m reaching out for your advice.
PS: I think it would be very helpful to include a Q/DQ placement guide in the official documentation again.
Below, I’ll show a minimal example of the ONNX and TensorRT engine graphs I experimented with.
This is the fp32 ONNX model and the TRT engine generated by implicit quantization:
When i use the basic configuration of the pytorch_quantization toolkit, which means i don't quantize concat (as proposed in this issue https://github.com/NVIDIA/TensorRT/issues/3861), i get these results:
The second thing tried was adding Q/DQ nodes to both inputs of concat (as proposed in this issue https://github.com/NVIDIA/TensorRT/issues/1144), this is the result:
Then i only quantized the residual path, which is my best try so far, however there are still unfused scale/pointwise operators:
If you have any idea how to correctly place the Q/DQ nodes in this scenario to match implicit quantization behavior, I’d really appreciate your insights. Also, if there’s a hidden or updated Q/DQ placement guide somewhere, I’d be very grateful if you could point me to it. Thank you so much for your help and time!
@ttyio tagging you since you’ve provided valuable input on many INT8 quantization issues :)
Environment
TensorRT Version: 10.8.0.43
NVIDIA GPU: GeForce RTX 4090
NVIDIA Driver Version: 550.120
CUDA Version: 12.8.0.38
CUDNN Version: 9.7.1.26
hi there~ have you tried manually set the scale value of the two Q/DQ node before the concat to same? I just found that if i set the Q/DQ at the end of different path of concat node to same, trt would remove the scale op in the engine graph. but i'm not sure if it would deal the pointwise op.
Hey @OminousBlackCat – thanks a lot for your idea!
I’m not sure I fully understood your suggestion. Are you proposing to add Q/DQ nodes to both input paths of the concat operation? And those Q/DQ pairs should share the same scale, meaning that in the Python code they would both use the same TensorQuantizer instance?
To give you more context, here is my current PyTorch model implementation: 👉 https://gist.github.com/patrickgrommelt/53c18653cde1d1dc7fd2684dceeae6db
Let me know if that aligns with what you had in mind!