funasr + Whisper语音识别-多语言-large-v3 + fsmn-vad + ct-punc-c + cam++ 报错 raise NotImplementedError("batch decoding is not implemented") NotImplementedError: batch decoding is not implemented
Notice: In order to resolve issues more efficiently, please raise issue following the template. (注意:为了更加高效率解决您遇到的问题,请按照模板提问,补充细节)
❓ Questions and Help
使用funasr调用Whisper语音识别-多语言-large-v3这个模型,并使用了fsmn-vad ; ct-punc-c ;cam++,报错 错误如下: funasr version: 1.1.16. Check update of funasr, and it would cost few times. You may disable it by set disable_update=True in AutoModel You are using the latest version of funasr-1.1.16 Detect model requirements, begin to install it: /data/lproot/dl/speechRec/models/whisper-more-language/requirements.txt install model requirements successfully Detect model requirements, begin to install it: /data/lproot/dl/speechRec/models/spk/requirements.txt install model requirements successfully rtf_avg: 0.038: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.15s/it] 0%| | 0/1 [00:00<?, ?it/sTraceback (most recent call last): | 0/8 [00:00<?, ?it/s] File "/data/lproot/dl/speechRec/core/paraformer/main.py", line 42, in res = model.generate(input=path, File "/home/lproot/.conda/envs/whisper/lib/python3.10/site-packages/funasr/auto/auto_model.py", line 304, in generate return self.inference_with_vad(input, input_len=input_len, **cfg) File "/home/lproot/.conda/envs/whisper/lib/python3.10/site-packages/funasr/auto/auto_model.py", line 458, in inference_with_vad results = self.inference( File "/home/lproot/.conda/envs/whisper/lib/python3.10/site-packages/funasr/auto/auto_model.py", line 343, in inference res = model.inference(**batch, **kwargs) File "/home/lproot/.conda/envs/whisper/lib/python3.10/site-packages/funasr/models/whisper/model.py", line 66, in inference raise NotImplementedError("batch decoding is not implemented") NotImplementedError: batch decoding is not implemented 0%| | 0/8 [00:00<?, ?it/s] 0%| | 0/1 [00:00<?, ?it/s]
Before asking:
What is your question?
Code
from funasr import AutoModel
# paraformer_zh_path = "/models/Paraformer"
paraformer_zh_path = "models/whisper-more-language"
vad_model_path = "/models/Fsmn"
punc_model_path = "/models/punc_ct"
spk_model_path = "/models/spk"
# 离线加载模型
model = AutoModel(
model=paraformer_zh_path,
vad_model=vad_model_path,
punc_model=punc_model_path,
spk_model=spk_model_path,
#device="cuda:0",
)
path = "新录音1.m4a"
DecodingOptions = {
"task": "transcribe",
"language": None,
"beam_size": None,
"fp16": True,
"without_timestamps": False,
"prompt": None,
}
res = model.generate(input=path,
DecodingOptions=DecodingOptions,
batch_size_s=300,
hotword='C16',)
print(res)
What have you tried?
What's your environment?
- OS (e.g., Linux):centos7.9
- FunASR Version (e.g., 1.0.0):1.1.16
- ModelScope Version (e.g., 1.11.0):1.20.1
- PyTorch Version (e.g., 2.0.0): pytorch-lightning 2.4.0 pytorch-metric-learning 2.6.0 pytorch-wpe 0.0.1 torch 2.3.1 torch-audiomentations 0.11.1 torch-complex 0.4.4 torch-pitch-shift 1.2.4 torchaudio 2.3.1 torchmetrics 1.4.1
- How you installed funasr (
pip, source):pip - Python version:3.10
- GPU (e.g., V100M32):A40
- CUDA/cuDNN version (e.g., cuda11.7):11.8,8.6.0
- Docker version (e.g., funasr-runtime-sdk-cpu-0.4.1)
- Any other relevant information:
我也遇到了,请问你现在解决了吗
我跟你遇到一样的问题,害 FunASR 官方文档写是支持 Whisper-large-v3 的
一样的错误,有参数可以配置取消批量模式?batch mode
我用cursor改了一版能跑了,下面是cursor总结的
1. 为什么会有这个问题
这个问题的根本原因在于FunASR库中Whisper模型实现的局限性:
-
批处理未实现:在FunASR库的WhisperWarp类中,
inference方法明确检查batch_size参数,当它大于1时会抛出错误"batch decoding is not implemented"。这是因为Whisper模型的实现没有支持批处理功能。 -
VAD分段处理:当处理较长音频时,FunASR会先用VAD(语音活动检测)模型将音频切分成多个片段,然后批量送入ASR模型处理。对于其他模型(如SenseVoice和Paraformer)这种方式效率很高,但Whisper模型不支持批处理。
-
参数配置不合理:默认的
batch_size_s参数为60秒,这会导致批处理大小非常大(内部转换为毫秒后约60000),而Whisper模型更适合处理较短的音频段。
2. 为此做的改动
为了解决这个问题,我实现了以下优化:
-
创建Whisper补丁:
- 通过猴子补丁(Monkey Patch)技术修改了WhisperWarp类的
inference方法 - 当
batch_size > 1时,不抛出错误,而是改为逐个处理每个样本,然后合并结果
- 通过猴子补丁(Monkey Patch)技术修改了WhisperWarp类的
-
限制批处理大小:
- 在补丁中添加了逻辑,将过大的
batch_size值(>100)自动降低到合理的值(5) - 在ASR引擎中为Whisper模型设置了较小的
batch_size_s值(5秒而非默认的60秒)
- 在补丁中添加了逻辑,将过大的
-
优化VAD分段:
- 为Whisper模型修改了VAD的最大分段长度,从默认的30000毫秒(30秒)减少到5000毫秒(5秒)
- 这使得Whisper模型可以处理更短的音频段,提高处理效率
-
补丁AutoModel:
- 修改了AutoModel的
generate方法,在检测到使用Whisper模型时自动调整批处理参数 - 这确保了在整个处理流程中,Whisper模型都使用最优的参数设置
- 修改了AutoModel的
-
参数传递优化:
- 在
run_model方法中增加了对vad_max_segment_length参数的支持 - 改进了参数传递逻辑,使配置更加灵活
- 在
You can fix this error by setting batch=1, like
MODEL_ROOT_DIR = ""
asr_model = AutoModel(
model=MODEL_ROOT_DIR+"Whisper-large-v3",
vad_model=MODEL_ROOT_DIR+"speech_fsmn_vad_zh-cn-16k-common-pytorch",
vad_kwargs={"max_single_segment_time": 30000},
punc_model=MODEL_ROOT_DIR+"punc_ct-transformer_zh-cn-common-vocab272727-pytorch",
device='cuda',
disable_update=True
)
res = asr_model.generate(
input=output_wav,
cache={},
language="en",
use_itn=True,
batch_size=1, # HERE
batch_size_s=300,
batch_size_threshold_s=60,
merge_vad=True,
merge_length_s=35,
)
text = rich_transcription_postprocess(res[0]["text"])
print(text)
Since VAD will crop the video in batches, but the ASR model cannot be implemented in batches. (I guess?)
把batch_size_s参数改为0或None