vits icon indicating copy to clipboard operation
vits copied to clipboard

poor performance on short phrases

Open negidius opened this issue 4 years ago • 8 comments

I trained the multi speaker model on VCTK (~400k) and for longer input phrases (ie >5 words), performance is approximately comparable to the released pretrained model.

For shorter phrases (ie 1-2 words), pronunciation becomes significantly degraded. Words that are pronounced correctly as part of a longer phrase become hard to understand when passed as the only word in the input.

Is anyone else experiencing this? Would love some intuition behind what's causing this and how to correct this issue.

negidius avatar Oct 02 '21 19:10 negidius

I also experienced the pronunciation problem. My case was worse since the pronunciation significantly get degraded even for long inputs. Have you solve this?

wade3han avatar Nov 09 '21 01:11 wade3han

I trained the multi speaker model on VCTK (~400k) and for longer input phrases (ie >5 words), performance is approximately comparable to the released pretrained model.

For shorter phrases (ie 1-2 words), pronunciation becomes significantly degraded. Words that are pronounced correctly as part of a longer phrase become hard to understand when passed as the only word in the input.

Is anyone else experiencing this? Would love some intuition behind what's causing this and how to correct this issue.

I also encountered this problem.

wizardk avatar Dec 16 '21 09:12 wizardk

@wizardk what gpus are you training on? Did you have to change batch size/lr to adapt to your hardware setup? Can you upload some example wav files to google drive?

There were several issues with my original setup, and the solution to your specific issue depends on the answers to the above

negidius avatar Dec 17 '21 00:12 negidius

@wizardk what gpus are you training on? Did you have to change batch size/lr to adapt to your hardware setup? Can you upload some example wav files to google drive?

There were several issues with my original setup, and the solution to your specific issue depends on the answers to the above

  1. I used 8 GPUs to train on private multi-speaker & multi-language data.
  2. Based on vctk_base.json, I changed batch_size form 64 to 16.

wizardk avatar Dec 20 '21 03:12 wizardk

Hi guys, any updates?

martin3252 avatar Apr 22 '22 02:04 martin3252