poor performance on short phrases
I trained the multi speaker model on VCTK (~400k) and for longer input phrases (ie >5 words), performance is approximately comparable to the released pretrained model.
For shorter phrases (ie 1-2 words), pronunciation becomes significantly degraded. Words that are pronounced correctly as part of a longer phrase become hard to understand when passed as the only word in the input.
Is anyone else experiencing this? Would love some intuition behind what's causing this and how to correct this issue.
I also experienced the pronunciation problem. My case was worse since the pronunciation significantly get degraded even for long inputs. Have you solve this?
I trained the multi speaker model on VCTK (~400k) and for longer input phrases (ie >5 words), performance is approximately comparable to the released pretrained model.
For shorter phrases (ie 1-2 words), pronunciation becomes significantly degraded. Words that are pronounced correctly as part of a longer phrase become hard to understand when passed as the only word in the input.
Is anyone else experiencing this? Would love some intuition behind what's causing this and how to correct this issue.
I also encountered this problem.
@wizardk what gpus are you training on? Did you have to change batch size/lr to adapt to your hardware setup? Can you upload some example wav files to google drive?
There were several issues with my original setup, and the solution to your specific issue depends on the answers to the above
@wizardk what gpus are you training on? Did you have to change batch size/lr to adapt to your hardware setup? Can you upload some example wav files to google drive?
There were several issues with my original setup, and the solution to your specific issue depends on the answers to the above
- I used 8 GPUs to train on private multi-speaker & multi-language data.
- Based on vctk_base.json, I changed batch_size form 64 to 16.
Hi guys, any updates?