CosyVoice The speech file generated by cosyvoice.inference_zero_shot does not match the order of the words. The generated audio file is basically wrong.

The speech file generated by cosyvoice.inference_zero_shot does not match the order of the words. The generated audio file is basically wrong. I used fastapi to encapsulate the cosyvoice.inference_zero_shot interface, but the order of words in the audio file generated by the model is wrong and words are missing. code： async def inference_zero_shot( tts_text: str = Form(...), instruct_text: str = Form(...), prompt_audio: UploadFile = File(...) ): try: file_content = await prompt_audio.read() check_audio_duration(file_content) with open('temp_prompt.wav', 'wb') as f: f.write(file_content) prompt_speech_16k = load_wav('temp_prompt.wav', 16000)

    for i, j in enumerate(cosyvoice.inference_zero_shot(tts_text, instruct_text, prompt_speech_16k, stream=False)):
        output_path = 'zero_shot_{}.wav'.format(i)
        torchaudio.save(output_path, j['tts_speech'], 24000)
        return FileResponse(output_path)
except Exception as e:
    raise HTTPException(status_code=500, detail=str(e))

May 19 '25 07:05 cckamiya

24000 chang to cosyvoice.sample_rate, add format="wav"

May 23 '25 03:05 FAFUuser

update code and model, follow readme instruction to see if it works first

May 26 '25 03:05 aluminumbox

@cckamiya I encountered a similar issue. When I trained a model initialized with Qwen-2.5 and the training did not converge well, the word order became misaligned and some words were skipped. Once the training proceeded successfully, this failure case was greatly reduced.

Jun 10 '25 13:06 hayeong0