Multispeaker
First I wanna say that I love this repo, it performs excellently with many of my small (down to 80 seconds) and noisy datasets, having good transfer learning. Multispeaker support when? Or is it technically implemented but not used yet?
@ZDisket I haven't try it yet since I am busy with some other research throughout these days. I am curious about your results on small datasets. How good is the audio quality?
@ming024 Here's an audio sample. Now, I was exaggerating a bit when saying performs excellently, but it's good compared to Tacotron2, which yielded a completely unusable model. I always upsample my audio to 48KHz then treble boost and reduce bass to improve audio quality.
@ZDisket I think autoregressive models require more data to learn a good alignment between text sequences and spectrograms. It is an interesting experiment to compare autoregressive and non-autoregressive models under low-resourced setting.
@ming024 There's something about your implementation specifically that makes it perform excellently on my hard and small datasets, I tried another one and the models are also unusable.
@ZDisket maybe it is because that I use phoneme sequences instead of character sequences? I think that most TTS models available online use character sequences as inputs. But it is still an problem to handle punctuations with phoneme sequences aligned with MFA in my implementation.
@ming024 No, the other implementation also uses durations from MFA-extracted phonemes which I implemented using yours as a reference, except that my equivalent of https://github.com/ming024/FastSpeech2/blob/e0a28e04db6631a4f9303a898b690ebf1ebea7fe/utils.py#L40
I used round() instead of int() because it led to greater stability especially in longer prompts.
@ZDisket I think the error of int() won't propagate? It will if I use
durations.append(int((e-s)*hp.sampling_rate/hp.hop_length))
But because the end frame of the previous phone will exactly be the start frame of the current phone, I think there is very little difference between int() and round().
@ming024 In MFA implementation I also have some code that corrects mismatches between durations and mel lengths. The durations calculated via int() mismatch about 20 to 50 frames each while the round() ones are 3 to -5. I was also skeptical until I tested out the model with round() durations. The concept is, for example, 3.8 = 4, not 3.
@ZDisket Yeah I understand your opinion now. But how comes the number 30-50? 1 frame is 256/22050=0.0116 seconds, so that 50 frames is more than half one second. Or you mean 30~50 sample points in the raw waveform?
@ming024 I really mean frames, and as to why it happens, I don't know as I didn't really explore that repo's preprocessing step.
@ZDisket Wow that is a big difference. Do you have an example? I have to checkout where this error come from....
@ming024 I have one: https://github.com/TensorSpeech/TensorflowTTS/issues/107#issuecomment-656447235
First I wanna say that I love this repo, it performs excellently with many of my small (down to 80 seconds) and noisy datasets, having good transfer learning. Multispeaker support when? Or is it technically implemented but not used yet?
@ZDisket Can you please explain how did you make this transfer learning? I am working on the same experiment. I am trying to adapt the pre-trained model to a few data.