CosyVoice icon indicating copy to clipboard operation
CosyVoice copied to clipboard

The speech file generated by cosyvoice.inference_zero_shot does not match the order of the words. The generated audio file is basically wrong.

Open cckamiya opened this issue 9 months ago • 2 comments

The speech file generated by cosyvoice.inference_zero_shot does not match the order of the words. The generated audio file is basically wrong. I used fastapi to encapsulate the cosyvoice.inference_zero_shot interface, but the order of words in the audio file generated by the model is wrong and words are missing. code: async def inference_zero_shot( tts_text: str = Form(...), instruct_text: str = Form(...), prompt_audio: UploadFile = File(...) ): try: file_content = await prompt_audio.read() check_audio_duration(file_content) with open('temp_prompt.wav', 'wb') as f: f.write(file_content) prompt_speech_16k = load_wav('temp_prompt.wav', 16000)

    for i, j in enumerate(cosyvoice.inference_zero_shot(tts_text, instruct_text, prompt_speech_16k, stream=False)):
        output_path = 'zero_shot_{}.wav'.format(i)
        torchaudio.save(output_path, j['tts_speech'], 24000)
        return FileResponse(output_path)
except Exception as e:
    raise HTTPException(status_code=500, detail=str(e))

cckamiya avatar May 19 '25 07:05 cckamiya

24000 chang to cosyvoice.sample_rate, add format="wav"

FAFUuser avatar May 23 '25 03:05 FAFUuser

update code and model, follow readme instruction to see if it works first

aluminumbox avatar May 26 '25 03:05 aluminumbox

@cckamiya I encountered a similar issue. When I trained a model initialized with Qwen-2.5 and the training did not converge well, the word order became misaligned and some words were skipped. Once the training proceeded successfully, this failure case was greatly reduced.

hayeong0 avatar Jun 10 '25 13:06 hayeong0