FastSpeech2 Multispeaker

First I wanna say that I love this repo, it performs excellently with many of my small (down to 80 seconds) and noisy datasets, having good transfer learning. Multispeaker support when? Or is it technically implemented but not used yet?

Jul 24 '20 13:07 ZDisket

@ZDisket I haven't try it yet since I am busy with some other research throughout these days. I am curious about your results on small datasets. How good is the audio quality?

Jul 24 '20 15:07 ming024

@ming024 Here's an audio sample. Now, I was exaggerating a bit when saying performs excellently, but it's good compared to Tacotron2, which yielded a completely unusable model. I always upsample my audio to 48KHz then treble boost and reduce bass to improve audio quality.

Jul 24 '20 16:07 ZDisket

@ZDisket I think autoregressive models require more data to learn a good alignment between text sequences and spectrograms. It is an interesting experiment to compare autoregressive and non-autoregressive models under low-resourced setting.

Jul 25 '20 00:07 ming024

@ming024 There's something about your implementation specifically that makes it perform excellently on my hard and small datasets, I tried another one and the models are also unusable.

Jul 25 '20 01:07 ZDisket

@ZDisket maybe it is because that I use phoneme sequences instead of character sequences? I think that most TTS models available online use character sequences as inputs. But it is still an problem to handle punctuations with phoneme sequences aligned with MFA in my implementation.

Jul 25 '20 02:07 ming024

@ming024 No, the other implementation also uses durations from MFA-extracted phonemes which I implemented using yours as a reference, except that my equivalent of https://github.com/ming024/FastSpeech2/blob/e0a28e04db6631a4f9303a898b690ebf1ebea7fe/utils.py#L40

I used round() instead of int() because it led to greater stability especially in longer prompts.

Jul 25 '20 02:07 ZDisket

@ZDisket I think the error of int() won't propagate? It will if I use

durations.append(int((e-s)*hp.sampling_rate/hp.hop_length))

But because the end frame of the previous phone will exactly be the start frame of the current phone, I think there is very little difference between int() and round().

Jul 25 '20 06:07 ming024

@ming024 In MFA implementation I also have some code that corrects mismatches between durations and mel lengths. The durations calculated via int() mismatch about 20 to 50 frames each while the round() ones are 3 to -5. I was also skeptical until I tested out the model with round() durations. The concept is, for example, 3.8 = 4, not 3.

Jul 25 '20 07:07 ZDisket

@ZDisket Yeah I understand your opinion now. But how comes the number 30-50? 1 frame is 256/22050=0.0116 seconds, so that 50 frames is more than half one second. Or you mean 30~50 sample points in the raw waveform?

Jul 25 '20 09:07 ming024

@ming024 I really mean frames, and as to why it happens, I don't know as I didn't really explore that repo's preprocessing step.

Jul 25 '20 14:07 ZDisket

@ZDisket Wow that is a big difference. Do you have an example? I have to checkout where this error come from....

Jul 26 '20 01:07 ming024

@ming024 I have one: https://github.com/TensorSpeech/TensorflowTTS/issues/107#issuecomment-656447235

Jul 26 '20 02:07 ZDisket

First I wanna say that I love this repo, it performs excellently with many of my small (down to 80 seconds) and noisy datasets, having good transfer learning. Multispeaker support when? Or is it technically implemented but not used yet?

@ZDisket Can you please explain how did you make this transfer learning? I am working on the same experiment. I am trying to adapt the pre-trained model to a few data.

Sep 05 '22 11:09 Aliraheem