huangx06
huangx06
我也觉得字建模准确率高不了啊。太多同音字了,模型想捕捉住正确的,不得有很强的上下文建模能力才行?
I think you can use ASR to convert your audios to text.
I don't know the exact problem of you. The training data of tacotron model is the symbol-audio pairs. You said you have audios without labeled texts. So I suggest that...
Yes. ASR refers to Automatic Speech Recognition but I don't think glue ASR and TTS model together would be something convenient.