paraformer onnx-gpu 转 tensorrt 报错 (Could not find any implementation for node)

Open willnufe opened this issue 1 year ago • 3 comments

1. environment

OS (e.g., Linux): Linux
FunASR Version (e.g., 1.0.0): 1.1.3
ModelScope Version (e.g., 1.11.0): 1.11.0
GPU (e.g., V100M32): A100

1.1 pt to onnx(predictor的cif部分使用的是cif_v1):

onnx-simplifier: 0.4.36
PyTorch Version (e.g., 2.0.0): 2.0.1
How you installed funasr (pip, source): pip
Python version: 3.9.18
CUDA/cuDNN version (e.g., cuda11.7): 11.7

1.2 onnx to tensorrt:

tensorrt(trtexec): 8.6.1.6
CUDA/cuDNN version (e.g., cuda11.7): 11.3

2. problem

使用下面的命令对 paraformer onnx-gpu 模型进行转换，报错

trtexec 
--onnx=/raid/t3cv/wangch/WORK_SAPCE/ASR/models/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/model_sim.onnx 
--saveEngine=/raid/t3cv/wangch/WORK_SAPCE/TEMP/work_space/onnx2tensorrt/models/model.engine  
--minShapes=speech:1x1000x560,speech_lengths:1 
--optShapes=speech:16x1000x560,speech_lengths:16 
--maxShapes=speech:16x1000x560,speech_lengths:16 
--workspace=24576
--verbose  --fp16 --device=7

主要错误是：

Error[10]: Could not find any implementation for node 
{ForeignNode[(Unnamed Layer* 6555) [Constant] + (Unnamed Layer* 6556) [Shuffle].../decoder/decoders/decoders.0/self_attn/Transpose + (Unnamed Layer* 7213) [Shuffle]]}.

Jul 24 '24 07:07 willnufe

@willnufe Need to make some modifications to the code in order to support it successfully. I don't have time recently, but if you are willing to do it, I can give you some suggestions offline.

Jul 29 '24 02:07 yuekaizhang

@willnufe Need to make some modifications to the code in order to support it successfully. I don't have time recently, but if you are willing to do it, I can give you some suggestions offline.

Thank you very much. I want to make some attempts. Please give me some suggestions.

Jul 29 '24 03:07 willnufe

@willnufe I think to get the max throughput. We need to first make onnx fp16 paraformer work.

https://github.com/modelscope/FunASR/commit/9a9b474e7de7cc90d2ee124dc8d6c2cfa887c059. This PR used several registered_hook to rescale the torchscript fp32 model to torchscript fp16 model. The first thing is to follow it to calibrate onnx fp32 model.

With onnx fp16, you could expect about 50% throughput improvement comparing with onnx fp32 pipeline. Then let's work on tensorrt export.

Would you mind adding my wechat ykzhang2020?

Jul 29 '24 03:07 yuekaizhang