Unexpected Error Encountered During Inference with `paraformer-zh`

Open ludwigax opened this issue 2 months ago • 1 comments

❓ Questions and Help

What is your question?

I'm trying to deploy paraformer-zh as an ASR model in a Python environment, following the example provided in FUNASR/examples. I used the exact same code without any modifications, but the output is unexpected — the model generates strange strings like "galaxy" or "galaxy xy".

I've double-checked the input audio's properties (bit depth, sampling rate, and data type), and everything appears to be correct. At this point, I'm unsure what might be causing this issue.

Code

from funasr import AutoModel

model = AutoModel(model="paraformer-zh", device="cuda", model_revision="v2.0.4")
result = model.generate(
    input='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav',
)
print(result)

with a output

Notice: ffmpeg is not installed. torchaudio is used to load audio
If you want to use ffmpeg backend to load audio, please install it by:
        sudo apt install ffmpeg # ubuntu
        # brew install ffmpeg # mac
funasr version: 1.2.7.
Check update of funasr, and it would cost few times. You may disable it by set `disable_update=True` in AutoModel
You are using the latest version of funasr-1.2.7
Downloading Model from https://www.modelscope.cn to directory: C:\Users\Ludwig\.cache\modelscope\hub\models\iic\speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
2025-11-14 04:16:36,063 - modelscope - INFO - Use user-specified model revision: v2.0.4
WARNING:root:trust_remote_code: False
rtf_avg: 0.078: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.29it/s]
[{'key': 'asr_example_zh', 'text': 'galaxy xy', 'timestamp': [[750, 2930], [3370, 4090]]}]

What's your environment?

OS (e.g., Linux): WINDOWS 10
FunASR Version (e.g., 1.0.0): 1.2.7
ModelScope Version (e.g., 1.11.0): 1.31.0
PyTorch Version (e.g., 2.0.0): 2.4.0+cu118
How you installed funasr (pip, source): from git clone with wheel build
Python version: 3.11.9
GPU (e.g., V100M32) RTX4070
CUDA/cuDNN version (e.g., cuda11.7): 12.8
Docker version (e.g., funasr-runtime-sdk-cpu-0.4.1): no docker, no wsl :(
Any other relevant information:

Nov 13 '25 20:11 ludwigax

In a nutshell

There seems to be an inconsistent audio loading behavior, caused by how load_audio_text_image_video() internally relies on ffmpeg. When ffmpeg is not available (typical on Windows), the fallback audio decoding path produces incorrect waveforms, leading to completely wrong ASR outputs (e.g., “galaxy xy”, timestamps mismatch, etc.). This may indicate bug in the Windows fallback audio loader inside load_utils.

I used the following minimal reproducible script:

from funasr.utils.load_utils import load_audio_text_image_video, extract_fbank
data_in = ["./asr_example_zh.wav"]
frontend_fs = 16000
audio_sample_list = load_audio_text_image_video(
    data_in, fs=frontend_fs, audio_fs=16000
)
print(type(audio_sample_list))
print(audio_sample_list[:10])

The exact same Python environment was created twice: Windows 10 (no ffmpeg installed) Ubuntu 22.04 inside Docker (ffmpeg installed) Only difference: presence of ffmpeg.

On Windows

[{'key': 'asr_example_zh', 'text': 'galaxy xy', 'timestamp': [[750, 2930], [3370, 4090]]}]
<class 'list'>
[tensor([ 0.0022,  0.0022,  0.0022,  ..., -0.0037, -0.0022, -0.0015])]

On linux inside docker

[{'key': 'asr_example_zh', 'text': '欢 迎 大 家 来 体 验 达 摩 院 推 出 的 语 音 识 别 模 型', 'timestamp': [[870, 1110], [1110, 1350], [1370, 1530], [1530, 1770], [1770, 2010], [2010, 2170], [2170, 2410], [2490, 2650], [2650, 2830], [2830, 3030], [3030, 3230], [3230, 3470], [3470, 3710], [3710, 3950], [3950, 4190], [4210, 4410], [4410, 4610], [4610, 4830], [4830, 5245]]}]
<class 'list'>
[tensor([ 9.1553e-05,  9.1553e-05,  9.1553e-05,  ..., -1.5259e-04,
        -9.1553e-05, -6.1035e-05])]

Nov 13 '25 21:11 ludwigax