Stochastic duration prediction failed for fastspeech2
I applied the stochastic duration predictor to the fastspeech2 model.
Duration loss is falling smoothly (1.2 to 0.2)

But, in inference, the duration predictor does not work at all. (noise scale=0.333)

Does anyone know the cause of this problem? The pseudo code I used is like below
# in variance adaptor
inputs = text_encoder_output + extended_speaker_embedding
sdp_mask = torch.unsqueeze(sequence_mask(text_lens, inputs.shape[-1]), 1).to(inputs.dtype)
if training:
duration_prediction = self.duration_predictor(
inputs , sdp_mask, torch.log(attn_hard_dur.float() + 1).unsqueeze(1)
)
duration_prediction = duration_prediction / torch.sum(sdp_mask)
else:
duration_prediction = self.duration_predictor(inputs , sdp_mask, reverse=True, noise_scale=0.333)
duration_prediction = duration_prediction.squeeze(1)
duration_rounded = torch.clamp(
(torch.round(torch.exp(duration_prediction) - 1) * d_control),
min=1,
)
# loss
duration_loss = torch.sum(duration_prediction.float())
How's the synth result with fs2 duration predictor after the same steps of training? And also, in fs2 training, grad from duration predictor is passed to encoder, while in vits, it used x.detach() to cut off grad, I think this might also be taken into consideration. https://github.com/jaywalnut310/vits/blob/2e561ba58618d021b5b8323d3765880f7e0ecfdb/models.py#L51
@OnceJune Thanks for your Reply. fastspeech2 duration predictor works well. Audio sample synthesized by ddp is like below. https://user-images.githubusercontent.com/44384060/154802366-3e1a959f-8652-4adb-95f8-f234ceb09d87.mp4
I think that's a very good point. However, as mentioned in paper, I am afraid that the loss obtained from the noise of SDP could affect adversely to text encoder(like mispronunciation). I'll test this out and report if result is good.
@LEECHOONGHO Hi mate, have you sucessfully applied SDP to fs2?