vits icon indicating copy to clipboard operation
vits copied to clipboard

Result getting worse when i use ground truth duration.

Open AlexanderXuan opened this issue 4 years ago • 53 comments

Dear author, thank you for your contribution for TTS, this is a big step in E2E TTS. But when I use ground truth duration aiming to train faster and get more accurate duration, duration loss drops fast and kl loss drops slow. I only change the attn matrix using true duration. I check the consist of loss, but not find the alignment realated part. Could you please give me some help with this problem?

AlexanderXuan avatar Jun 21 '21 10:06 AlexanderXuan

@AlexanderXuan For the original duration training method, do you have good synthesized results? I tried to train the model using my own chinese datasets, but the training seems abnormal,and the synthesized wavs are bad with 180k.pth.

Liujingxiu23 avatar Jun 28 '21 02:06 Liujingxiu23

@Liujingxiu23 Sorry for my late reply. My training progress is no problem, but in my original training result, chinese voice has some pitch problem, in my opinion, it is caused by VAE part.

AlexanderXuan avatar Jul 06 '21 14:07 AlexanderXuan

@AlexanderXuan Thank you for your reply. I made some mistakes in my training, and when I fixed them and then train on my chinese dataset, the synthesized wavs are excellent, and without any pitch problem. Pitch problem may related to speaker?

Liujingxiu23 avatar Jul 07 '21 03:07 Liujingxiu23

@Liujingxiu23 I use multi speaker version and use 6 speakers to train the model, the pitch has little problem with 64k.pth. Maybe my training time isn't enough or just my config has some problem too. Can you give me a email address, i want to communicate with you for some other problems.

AlexanderXuan avatar Jul 07 '21 03:07 AlexanderXuan

@AlexanderXuan I also use train_ms.py to train multi-speaker model, I use 8 female speakers of sample rate of 16000 and all other configs are as default. I check outputs of two speakers from checkpoint 65000.pth, the synthsized wavs are good without any pitch problem.

But in the paper of Glow-WaveGan which is similar as vits, the auothor do add a pitch-predictor. https://arxiv.org/abs/2106.10831?context=cs

Liujingxiu23 avatar Jul 07 '21 07:07 Liujingxiu23

@Liujingxiu23 Can you share some sample with me? My email address is [email protected]. Maybe I should train my model again.

AlexanderXuan avatar Jul 07 '21 09:07 AlexanderXuan

@AlexanderXuan Sorry , I can not. I am working in a business company not a research center, our data is private.

Liujingxiu23 avatar Jul 07 '21 09:07 Liujingxiu23

@Liujingxiu23 ok, thank you.

AlexanderXuan avatar Jul 07 '21 09:07 AlexanderXuan

@AlexanderXuan Thank you for your reply. I made some mistakes in my training, and when I fixed them and then train on my chinese dataset, the synthesized wavs are excellent, and without any pitch problem. Pitch problem may related to speaker?

Hi @Liujingxiu23, what was wrong in your training ?

leminhnguyen avatar Jul 07 '21 10:07 leminhnguyen

@leminhnguyen I make mistakes in process my chinese text, the input symbols.

Liujingxiu23 avatar Jul 08 '21 02:07 Liujingxiu23

@Liujingxiu23 Your config is same as the default ? What about training time to get 300K steps ?

leminhnguyen avatar Jul 15 '21 12:07 leminhnguyen

@leminhnguyen with the default model setting, expect sample-rate=16000, 5 days to get 300k steps, with 2 GPU (V100).

Liujingxiu23 avatar Jul 16 '21 10:07 Liujingxiu23

@Liujingxiu23, For me, with the sample rate 22050 it took about 8 days to get 180K steps

leminhnguyen avatar Jul 16 '21 11:07 leminhnguyen

@leminhnguyen The training is not very fast, but it is convenient since it is end2end training compare to two stages trainning. And the training of the original hifigan is also time-comsuming.

Liujingxiu23 avatar Jul 19 '21 03:07 Liujingxiu23

@leminhnguyen Hello bro, what dataset did you use for training? Vietnamese or English?

ductho9799 avatar Aug 09 '21 14:08 ductho9799

@ductho9799 Hey bro, I've trained for Vietnamese.

leminhnguyen avatar Aug 09 '21 16:08 leminhnguyen

@leminhnguyen Which Vietnamese dataset did you have trained? Can you share it with me?

ductho9799 avatar Aug 09 '21 16:08 ductho9799

@ductho9799 Sorry, data is private so I cannot share it with you.

leminhnguyen avatar Aug 09 '21 16:08 leminhnguyen

@leminhnguyen Thank you so much! What company are you working at?

ductho9799 avatar Aug 09 '21 16:08 ductho9799

@ductho9799 Hey bro, I think here is not the place for chatting, so please send me an email to [email protected], hope to here from you soon !

leminhnguyen avatar Aug 09 '21 16:08 leminhnguyen

@leminhnguyen Hello, I'm interested in your experiment in Vietnamese task. Have you ever compared the synthesized audio quality made by VITS to FastSpeech2? If yes, which one do you think is more natural? I'm grateful if you share your exp.!

icyda17 avatar Jan 04 '22 09:01 icyda17

@icyda17 Hi, For my exp, VITS is better than Fastspeech2 about prosody and quality. But in some cases VITS suffered from mis-pronunciation.

leminhnguyen avatar Jan 04 '22 10:01 leminhnguyen

hi @AlexanderXuan ,I'm trying to use ground truth duration, but the added blank puzzles me. Should the blank be assigned with any duration? Or keep it zero?

OnceJune avatar Feb 09 '22 07:02 OnceJune

@OnceJune In my opinion, blank is used to figure out duration problem, if we use ground truth duration, we don't need to use blank. But i did not know if this will produce other problem, because my result has some problem, if you really need blank, maybe you can set blank duration to zero.

AlexanderXuan avatar Feb 09 '22 08:02 AlexanderXuan

@AlexanderXuan Thank you, I will use zero for blank. What's the problem in your result? Pitch or mispronunciation?

OnceJune avatar Feb 09 '22 08:02 OnceJune

@leminhnguyen Thanks. Mis-puntutation in ur case means bad duration or tone issues? Btw, can I ask you more personally in private email or other chat platforms?

icyda17 avatar Feb 10 '22 10:02 icyda17

@leminhnguyen Thanks. Mis-puntutation in ur case means bad duration or tone issues? Btw, can I ask you more personally in private email or other chat platforms?

You can contact with me via [email protected] 😃

leminhnguyen avatar Feb 10 '22 12:02 leminhnguyen

Dear author, thank you for your contribution for TTS, this is a big step in E2E TTS. But when I use ground truth duration aiming to train faster and get more accurate duration, duration loss drops fast and kl loss drops slow. I only change the attn matrix using true duration. I check the consist of loss, but not find the alignment realated part. Could you please give me some help with this problem?

Have you got good result using the groundtruth duration?

hdmjdp avatar Feb 16 '22 02:02 hdmjdp

@AlexanderXuan I am tring to use ground truth duration those days, but failed in training (at 10k steps, everything works well, but at about 15k loss and grad turn to nan, I use normal duration model , not the stochastic one ), how did you compute loss_dur, is the same as the original code like "l_length = torch.sum((log(duration_true) - log(duraiton_pred))**2, [1,2]) / torch.sum(x_mask)"?

Liujingxiu23 avatar Feb 23 '22 08:02 Liujingxiu23

Does anyone get success in training with ground-truth duration, and then get precise time boundary of phones and good wave in the inference?

Liujingxiu23 avatar Mar 04 '22 09:03 Liujingxiu23