vits icon indicating copy to clipboard operation
vits copied to clipboard

Stochastic duration prediction failed for fastspeech2

Open LEECHOONGHO opened this issue 3 years ago • 4 comments

I applied the stochastic duration predictor to the fastspeech2 model.

Duration loss is falling smoothly (1.2 to 0.2) image

But, in inference, the duration predictor does not work at all. (noise scale=0.333) image

Does anyone know the cause of this problem? The pseudo code I used is like below

# in variance adaptor
inputs = text_encoder_output + extended_speaker_embedding
sdp_mask = torch.unsqueeze(sequence_mask(text_lens, inputs.shape[-1]), 1).to(inputs.dtype)

if training:
    duration_prediction = self.duration_predictor(
        inputs , sdp_mask, torch.log(attn_hard_dur.float() + 1).unsqueeze(1)
    )
    duration_prediction = duration_prediction / torch.sum(sdp_mask)
else:
    duration_prediction = self.duration_predictor(inputs , sdp_mask, reverse=True, noise_scale=0.333)
    duration_prediction = duration_prediction.squeeze(1)

duration_rounded = torch.clamp(
                (torch.round(torch.exp(duration_prediction) - 1) * d_control),
                min=1,
            )

# loss
duration_loss = torch.sum(duration_prediction.float())

LEECHOONGHO avatar Feb 19 '22 12:02 LEECHOONGHO

How's the synth result with fs2 duration predictor after the same steps of training? And also, in fs2 training, grad from duration predictor is passed to encoder, while in vits, it used x.detach() to cut off grad, I think this might also be taken into consideration. https://github.com/jaywalnut310/vits/blob/2e561ba58618d021b5b8323d3765880f7e0ecfdb/models.py#L51

OnceJune avatar Feb 21 '22 06:02 OnceJune

@OnceJune Thanks for your Reply. fastspeech2 duration predictor works well. Audio sample synthesized by ddp is like below. https://user-images.githubusercontent.com/44384060/154802366-3e1a959f-8652-4adb-95f8-f234ceb09d87.mp4

I think that's a very good point. However, as mentioned in paper, I am afraid that the loss obtained from the noise of SDP could affect adversely to text encoder(like mispronunciation). I'll test this out and report if result is good.

LEECHOONGHO avatar Feb 21 '22 07:02 LEECHOONGHO

@LEECHOONGHO Hi mate, have you sucessfully applied SDP to fs2?

blx0102 avatar Nov 25 '22 02:11 blx0102