TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

INT8 quantization for HiFi-GAN vocoder -- performance issue

Open WhiteTeaDragon opened this issue 9 months ago • 7 comments

Hello!

My goal is to run trt engine for Hi-Fi GAN on H100 as efficiently as possible with small or none drop in quality. I have decided that the simplest path to achieve this would be as follows:

  1. To quantize the model to int8 via TensorRT-Model-Optimizer (modelopt) library.
  2. To finetune the quantized model (as per instructions).
  3. To convert the quantized model into trt using the --stronglyTyped option of trtexec, so that no unpredictable additional quantization is done by trtexec.

If I skip the first two steps and quantize the model using --int8 flag of trtexec, the model runs 2 times faster. In contrast, after quantization with modelopt library, there is no speedup at all for the resulting trt engine. Please see the detailed times in the corresponding issue at the modelopt repo.

Could you please give me some recommendations regarding quantization of the model?

  1. Is there any other way for finetuning the int8 version of the model before deploying it as a trt-engine?
  2. What would be the proper QDQ placement for the model for achieving speedup after quantization?
  3. If I convert the model to trtexec right away, the difference between --fp16 and --int8 flags is small. Shouldn't in theory --fp16 give the 2x speedup, while --int8 -- 4x?

Here is the link to the unquantized onnx model: https://drive.google.com/file/d/1qVotIH-0K73rXUlZ6yoPgIvOjlEOnvGY/view?usp=sharing

WhiteTeaDragon avatar Apr 30 '25 08:04 WhiteTeaDragon

Maybe @nzmora-nvidia or @galagam ?

yuanyao-nv avatar May 02 '25 21:05 yuanyao-nv

Thanks @yuanyao-nv . This issue should be filed under ModelOpt - https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues

galagam avatar May 05 '25 01:05 galagam

@galagam

Please read ModelOpt issue I mentioned: https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/80#issuecomment-2832485911

One of the developers, @i-riyad, asked me to post the issue here.

WhiteTeaDragon avatar May 05 '25 07:05 WhiteTeaDragon

Moreover, I would like to hear your opinion on question 1 from my original post: is there a way to finetune the model for int8 quantization in any other way, i.e. without modelopt?

WhiteTeaDragon avatar May 05 '25 13:05 WhiteTeaDragon

@WhiteTeaDragon sorry, I missed that. We'll discuss this internally and get back to you.

ModelOpt is the recommended approach. However, any other method that produces a valid ONNX file with Q/DQ nodes may be used. ONNXRuntime has quantization tools, you may also write a custom script to add Q/DQ nodes.

galagam avatar May 05 '25 14:05 galagam

Image

Could you please modify your model to make pad as the attribute of the following conv op @WhiteTeaDragon ? As the pad op is fused into conv op now, it is beneficial to perf.

Besides, you can use onnx-graphsurgeon tool to modify onnx model. Please refer to https://github.com/NVIDIA/TensorRT/blob/main/tools/onnx-graphsurgeon/examples/04_modifying_a_model/modify.py for details.

kris1025 avatar Jun 10 '25 03:06 kris1025

I think it's more beneficial to keep this discussion in one place - https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/80. Using ModelOpt for quantization is the recommended approach. Any updates/fixes for the quantization method should go into ModelOpt.

galagam avatar Jun 10 '25 06:06 galagam