INT8 quantization for HiFi-GAN vocoder -- performance issue
Hello!
My goal is to run trt engine for Hi-Fi GAN on H100 as efficiently as possible with small or none drop in quality. I have decided that the simplest path to achieve this would be as follows:
- To quantize the model to int8 via TensorRT-Model-Optimizer (
modelopt) library. - To finetune the quantized model (as per instructions).
- To convert the quantized model into trt using the
--stronglyTypedoption oftrtexec, so that no unpredictable additional quantization is done bytrtexec.
If I skip the first two steps and quantize the model using --int8 flag of trtexec, the model runs 2 times faster. In contrast, after quantization with modelopt library, there is no speedup at all for the resulting trt engine. Please see the detailed times in the corresponding issue at the modelopt repo.
Could you please give me some recommendations regarding quantization of the model?
- Is there any other way for finetuning the int8 version of the model before deploying it as a trt-engine?
- What would be the proper QDQ placement for the model for achieving speedup after quantization?
- If I convert the model to trtexec right away, the difference between
--fp16and--int8flags is small. Shouldn't in theory--fp16give the 2x speedup, while--int8-- 4x?
Here is the link to the unquantized onnx model: https://drive.google.com/file/d/1qVotIH-0K73rXUlZ6yoPgIvOjlEOnvGY/view?usp=sharing
Maybe @nzmora-nvidia or @galagam ?
Thanks @yuanyao-nv . This issue should be filed under ModelOpt - https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues
@galagam
Please read ModelOpt issue I mentioned: https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/80#issuecomment-2832485911
One of the developers, @i-riyad, asked me to post the issue here.
Moreover, I would like to hear your opinion on question 1 from my original post: is there a way to finetune the model for int8 quantization in any other way, i.e. without modelopt?
@WhiteTeaDragon sorry, I missed that. We'll discuss this internally and get back to you.
ModelOpt is the recommended approach. However, any other method that produces a valid ONNX file with Q/DQ nodes may be used. ONNXRuntime has quantization tools, you may also write a custom script to add Q/DQ nodes.
Could you please modify your model to make pad as the attribute of the following conv op @WhiteTeaDragon ? As the pad op is fused into conv op now, it is beneficial to perf.
Besides, you can use onnx-graphsurgeon tool to modify onnx model. Please refer to https://github.com/NVIDIA/TensorRT/blob/main/tools/onnx-graphsurgeon/examples/04_modifying_a_model/modify.py for details.
I think it's more beneficial to keep this discussion in one place - https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/80. Using ModelOpt for quantization is the recommended approach. Any updates/fixes for the quantization method should go into ModelOpt.