CosyVoice Some questions about flow

This is an amazing open source project! Why is the flow decoder used in cosyvoice much larger than the matcha-TTS decoder? What is the purpose of increasing the decoder? Is it to improve the sound quality or zero-shot performance?

Aug 30 '24 12:08 howitry

we use speech tokenizer, which means we must use flow model to reconstruct the mel sequence

Sep 03 '24 05:09 aluminumbox

we use speech tokenizer, which means we must use flow model to reconstruct the mel sequence

I understand that flow is used to transform code to mel. But the flow decoder used in the matcha-TTS(https://github.com/shivammehta25/Matcha-TTS/blob/main/configs/model/decoder/default.yaml) is much smaller than that of cosyvoice. I want to know the reason for increasing the flow decoder.

Sep 03 '24 06:09 howitry

I have the same confusion.

Sep 05 '24 12:09 huskyachao

This issue is stale because it has been open for 30 days with no activity.

Oct 06 '24 02:10 github-actions[bot]