FunASR icon indicating copy to clipboard operation
FunASR copied to clipboard

Problems to use vad_model, pucn_model and spk_model with streaming voice. 如何正常在流式处理中加载这3个模型?

Open 1113200320 opened this issue 1 year ago • 6 comments

仿照readme.md尝试加载多个模型流式处理, 遇到问题

keyword: fsmn-vad, ct-punc, cam++, is_final

What have you tried?

step1: 复制readme.md中Speech Recognition (Streaming) 这一节的代码, 其中 model = AutoModel(model="paraformer-zh-streaming", 运行正常 (完整代码和readme.md相同, 贴在最后一部分方便阅读)

step2: 将示例代码的model修改为:

model = AutoModel(model="paraformer-zh-streaming",
                  vad_model="fsmn-vad",  
                  punc_model="ct-punc", 
                  spk_model="cam++",
                  )
...
model.generate(input=speech_chunk, cache=cache, is_final=is_final) # 只保留这3个参数

运行结果:

所有返回均为空 (例如: [{'key': 'rand_key_2yW4Acq9GFz6Y', 'text': '', 'timestamp': []}] 这样)

step3: 将示例代码的model修改为:

model = AutoModel(model="paraformer-zh",
                  vad_model="fsmn-vad",  
                  punc_model="ct-punc", 
                  spk_model="cam++",
                  )
...
model.generate(input=speech_chunk, cache=cache, is_final=is_final) # 只保留这3个参数

运行结果: 和2相比, 前面返回均为空, 但是最后一个片段时, 有is_final=True, 返回了对应片段的字符, 例如 [{'key': 'rand_key_2yW4Acq9GFz6Y', 'text': '模型', 'timestamp': ...}] step4: 在3的基础上, 将model.generate改为model.generate(input=speech_chunk, cache=cache, is_final=True) 运行结果: 各分片运行正常, 但是由于总是设置is_final=True, 不能达到流式处理拼接对话和说话人区分的需求

What's your environment?

  • OS Windows11, dont't use docker
  • PyTorch Version (e.g., 2.0.0): 2.5.1
  • How you installed funasr (pip, source): pip
  • Python version: 3.12.3

My Question:

为什么is_final=False时无法识别? 哪位可以给一个包含了vad_model, pucn_model and spk_model且可以进行流式处理的代码示例? 非常感谢!


附原始完整代码

from funasr import AutoModel

chunk_size = [0, 10, 5] #[0, 10, 5] 600ms, [0, 8, 4] 480ms
encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention

model = AutoModel(model="paraformer-zh-streaming",
                  vad_model="fsmn-vad",  
                  punc_model="ct-punc", 
                  spk_model="cam++",
                  )

import soundfile
import os

wav_file = os.path.join(model.model_path, "example/asr_example.wav")
speech, sample_rate = soundfile.read(wav_file)
chunk_stride = chunk_size[1] * 960 # 600ms

cache = {}
total_chunk_num = int(len((speech)-1)/chunk_stride+1)
for i in range(total_chunk_num):
    speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
    is_final = i == total_chunk_num - 1
    res = model.generate(input=speech_chunk, cache=cache, is_final=is_final)
    print(res)

1113200320 avatar Nov 25 '24 09:11 1113200320

我也有类似的需求。如果楼主知道解决方法的话,请在评论区告知

wqzh avatar Dec 02 '24 10:12 wqzh

+1

AliceShen122 avatar Dec 10 '24 01:12 AliceShen122

流式好像没有标点符号

chengligen avatar Dec 29 '24 14:12 chengligen

+1

zskyliang avatar Feb 24 '25 10:02 zskyliang

+1

Ryaningli avatar Mar 21 '25 03:03 Ryaningli

+1

shaunabanana avatar Mar 24 '25 13:03 shaunabanana

+1请问有大佬解决了吗

Zongse avatar May 15 '25 07:05 Zongse

Image这个是2024年3月份阿里社区的回复,关于实时语音识别服务目前无法直接区分多个说话人,但是离线版本可以,不清楚现在已经过去一年了有没有新的更新

Zongse avatar May 15 '25 07:05 Zongse

问题加一

QiushiStaff avatar Jun 04 '25 02:06 QiushiStaff

问题加一

biao-lvwan avatar Jul 03 '25 10:07 biao-lvwan

+1~

dlz620301 avatar Nov 25 '25 12:11 dlz620301