The speech file generated by cosyvoice.inference_zero_shot does not match the order of the words. The generated audio file is basically wrong.
The speech file generated by cosyvoice.inference_zero_shot does not match the order of the words. The generated audio file is basically wrong. I used fastapi to encapsulate the cosyvoice.inference_zero_shot interface, but the order of words in the audio file generated by the model is wrong and words are missing. code: async def inference_zero_shot( tts_text: str = Form(...), instruct_text: str = Form(...), prompt_audio: UploadFile = File(...) ): try: file_content = await prompt_audio.read() check_audio_duration(file_content) with open('temp_prompt.wav', 'wb') as f: f.write(file_content) prompt_speech_16k = load_wav('temp_prompt.wav', 16000)
for i, j in enumerate(cosyvoice.inference_zero_shot(tts_text, instruct_text, prompt_speech_16k, stream=False)):
output_path = 'zero_shot_{}.wav'.format(i)
torchaudio.save(output_path, j['tts_speech'], 24000)
return FileResponse(output_path)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
24000 chang to cosyvoice.sample_rate, add format="wav"
update code and model, follow readme instruction to see if it works first
@cckamiya I encountered a similar issue. When I trained a model initialized with Qwen-2.5 and the training did not converge well, the word order became misaligned and some words were skipped. Once the training proceeded successfully, this failure case was greatly reduced.