Explicit quantization is slower than implicit quantization and produces invalid results
Description
Since implicit quantization is deprecated, I started migrating my model pipeline to explicit quantization. However, I encountered some issues:
- Different behaviour with concat:
With implicit quantization the graph looks like this:
A(fp16:linear) -> Concat
B(fp16:linear) -> Concat
C(fp16:linear) -> Concat -> Quantize+Reformat -> Conv
Basically, concat is replaced with a basic copy, since all inputs are aligned.
However, when I use explicit quantization the graph becomes like this:
A(fp16:linear) -> Quantize --> Concat
B(fp16:linear) -> Quantize --> Concat
C(fp16:linear) -> Quantize --> Concat --> Reformat -> Conv
TRT switched up Quantize and Concat, but this resulted in a suboptimal graph which is ~30% slower. No matter what I tried, I was not able to reproduce the plan from the implicit quantization using the explicitly quantized model.
- Q/DQ placement with ConvTranspose.
With implicit quantization TRT is able to fuse ConvTranspose and activation, and according to all recommendations, Q/DQ nodes should be placed like this:
input -> Q -> DQ -> ConvTranspose -> Activation -> Q -> DQ -> output
However, when I try this method, TRT fails to merge ConvTranspose and activation and this results in an invalid output. I am forced to do it like this:
input -> Q -> DQ -> ConvTranspose -> Q -> DQ -> Activation -> Q -> DQ -> output
- Explicitly quantized convolutions are slower than implicitly quantized ones
I get consistently slower profiling results with explicitly quantized model (~5%), and it seems like it mostly comes down to tactic selection. Algorithm selectors are deprecated and I cannot understand how to use editable timing cache for CaskConvolution nodes because there are absolutely no cache keys in verbose logs.
Additional issue: since my network uses FP16 inputs I expect TRT to be able to use it directly without any reformats. However, without DIRECT_IO flag TRT always first converts FP16 to FP32 and then back to FP16. DIRECT_IO is deprecated, what should I use as an alternative?
Environment
TensorRT Version: 10.8.0.43
NVIDIA GPU: RTX 3060 LT
NVIDIA Driver Version: 572.47
CUDA Version: 12.8.0
CUDNN Version: 9.7.1.26
Operating System: Windows 11
Relevant Files
Hey! Could you solve your issue? I just opened an issue with a similar problem (Q/QD placement with concat and convtranspose)
https://github.com/NVIDIA/TensorRT/issues/4401
@patrickgrommelt No. I decided to use implicit quantization since it is more stable for my model and still works. Hopefully, by the time NVIDIA removes it in later versions, they will fix the issues. Otherwise, I plan on including lean runtime for inference.