Some questions about flow
This is an amazing open source project! Why is the flow decoder used in cosyvoice much larger than the matcha-TTS decoder? What is the purpose of increasing the decoder? Is it to improve the sound quality or zero-shot performance?
we use speech tokenizer, which means we must use flow model to reconstruct the mel sequence
we use speech tokenizer, which means we must use flow model to reconstruct the mel sequence
I understand that flow is used to transform code to mel. But the flow decoder used in the matcha-TTS(https://github.com/shivammehta25/Matcha-TTS/blob/main/configs/model/decoder/default.yaml) is much smaller than that of cosyvoice. I want to know the reason for increasing the flow decoder.
I have the same confusion.
This issue is stale because it has been open for 30 days with no activity.