flowtron Training issue with Male voice

I am trying to train flowtron model with Male voice. After training for about 270,000 steps, the audio generated is very random. Not a single word is getting generated properly. It's not even learning attention. Earlier I tried with LJ speech dataset, after about 170,000 steps of training, audio samples were not so bad, the pronunciation was not up to the mark, But I could understand what was being said. I am attaching the attention plots here. I have the same amount of data as LJ speech. sid0_sigma0 5_attnlayer1 sid0_sigma0 5_attnlayer0

Jun 29 '20 07:06 akshay4malik

Training on what language? Did you try warm-starting from the pre-trained model? Trimming silences from the beginning and end of audio files helps with learning attention.

Jul 01 '20 04:07 rafaelvalle

I am doing it for Hindi Language. We have trimmed silences from the beginning and end of the audio files.

Jul 01 '20 04:07 akshay4malik

Were you able to train the same data on Tacotron before?

Jul 01 '20 04:07 rafaelvalle

Yes, We were getting good results on Tacotron 2 with the same data. But flowtron offers several new features, so we thought of training on it as well.

Jul 01 '20 04:07 akshay4malik

That's great news. Use pre-trained weights from you Tacotron model to warm-startt a Flowtron with a single step of flow. Once the first step of flow has learned to attend, add the second step of flow and train the full model.

Jul 01 '20 04:07 rafaelvalle

Ok, I will try that and will post the response. But what can be the possible reason in case I am directly trying to train flowtron model.

Jul 01 '20 04:07 akshay4malik

I think you're trying to learn both steps of flow at the same time. As we describe in our paper, it's easier to train Flowtron and its steps of flow progressively, for example:

First train Flowtron with one step of flow until it learns to attend to the text
Use this model to warm-start a Flowtron with 2 steps of flow and train the entire model

Jul 01 '20 04:07 rafaelvalle

While training from pretrained Tacotron2 model, there are some issues. I have tried to overcome some of them, But there are properties of Tacotron2 model, which are causing problems. Here is the error which I feel is because of the prenet and post net layers in tacotron 2, which are not present in Flowtron. RuntimeError: Error(s) in loading state_dict for Flowtron: Unexpected key(s) in state_dict: "decoder.prenet.layers.0.linear_layer.weight", "decoder.prenet.layers.1.linear_layer.weight", "decoder.attention_rnn.weight_ih", "decoder.attention_rnn.weight_hh", "decoder.attention_rnn.bias_ih", "decoder.attention_rnn.bias_hh", "decoder.attention_layer.query_layer.linear_layer.weight", "decoder.attention_layer.memory_layer.linear_layer.weight", "decoder.attention_layer.v.linear_layer.weight", "decoder.attention_layer.location_layer.location_conv.conv.weight", "decoder.attention_layer.location_layer.location_dense.linear_layer.weight", "decoder.decoder_rnn.weight_ih", "decoder.decoder_rnn.weight_hh", "decoder.decoder_rnn.bias_ih", "decoder.decoder_rnn.bias_hh", "decoder.linear_projection.linear_layer.weight", "decoder.linear_projection.linear_layer.bias", "decoder.gate_layer.linear_layer.weight", "decoder.gate_layer.linear_layer.bias", "postnet.convolutions.0.0.conv.weight", "postnet.convolutions.0.0.conv.bias", "postnet.convolutions.0.1.weight", "postnet.convolutions.0.1.bias", "postnet.convolutions.0.1.running_mean", "postnet.convolutions.0.1.running_var", "postnet.convolutions.0.1.num_batches_tracked", "postnet.convolutions.1.0.conv.weight", "postnet.convolutions.1.0.conv.bias", "postnet.convolutions.1.1.weight", "postnet.convolutions.1.1.bias", "postnet.convolutions.1.1.running_mean", "postnet.convolutions.1.1.running_var", "postnet.convolutions.1.1.num_batches_tracked", "postnet.convolutions.2.0.conv.weight", "postnet.convolutions.2.0.conv.bias", "postnet.convolutions.2.1.weight", "postnet.convolutions.2.1.bias", "postnet.convolutions.2.1.running_mean", "postnet.convolutions.2.1.running_var", "postnet.convolutions.2.1.num_batches_tracked", "postnet.convolutions.3.0.conv.weight", "postnet.convolutions.3.0.conv.bias", "postnet.convolutions.3.1.weight", "postnet.convolutions.3.1.bias", "postnet.convolutions.3.1.running_mean", "postnet.convolutions.3.1.running_var", "postnet.convolutions.3.1.num_batches_tracked", "postnet.convolutions.4.0.conv.weight", "postnet.convolutions.4.0.conv.bias", "postnet.convolutions.4.1.weight", "postnet.convolutions.4.1.bias", "postnet.convolutions.4.1.running_mean", "postnet.convolutions.4.1.running_var", "postnet.convolutions.4.1.num_batches_tracked", "encoder.convolutions.0.1.running_mean", "encoder.convolutions.0.1.running_var", "encoder.convolutions.0.1.num_batches_tracked", "encoder.convolutions.1.1.running_mean", "encoder.convolutions.1.1.running_var", "encoder.convolutions.1.1.num_batches_tracked", "encoder.convolutions.2.1.running_mean", "encoder.convolutions.2.1.running_var", "encoder.convolutions.2.1.num_batches_tracked".

Jul 01 '20 06:07 akshay4malik

These are harmless and expected given that Flowtron does not have these layers. Are you using warmstart_checkpoint_path instead of checkpoint_path?

Jul 01 '20 17:07 rafaelvalle

Yes I am using warmstart_checkpoint_path, and I have changed "include_layers": ["speaker", "encoder", "embedding"] to "include_layers": ["encoder", "embedding"] and using n_flows = 1. As you mentioned these errors are harmless, how can I avoid them and start training.

Jul 01 '20 17:07 akshay4malik

If you're using warmstart_checkpoint_path, the loaded state_dict should be filtered and not have the weights you listed.

Can you send the full stack-trace? If the issue is happening here, you might have to save the Tacotron 2 weights as a state_dict instead of loading ['model'].

Jul 01 '20 18:07 rafaelvalle

No, the problem in that function occurs here But this can be solved by removing this if condition as there are no speaker embeddings in Tacotron2 model. The problem occurs in loading optimizer here However, if I replace this line optimizer.load_state_dict(checkpoint_dict['optimizer']) with following lines,

optimizer.state_dict()['param_groups'] = checkpoint_dict['optimizer']['param_groups']
optimizer.state_dict()['state'] = checkpoint_dict['optimizer']['state']

Though I am not sure about this solution. And after all this, the problem comes in this line The error is following: RuntimeError: Error(s) in loading state_dict for Flowtron: Missing key(s) in state_dict: "speaker_embedding.weight", "flows.0.conv.weight", "flows.0.conv.bias", "flows.0.lstm.weight_ih_l0", "flows.0.lstm.weight_hh_l0", "flows.0.lstm.bias_ih_l0", "flows.0.lstm.bias_hh_l0", "flows.0.lstm.weight_ih_l1", "flows.0.lstm.weight_hh_l1", "flows.0.lstm.bias_ih_l1", "flows.0.lstm.bias_hh_l1", "flows.0.attention_lstm.weight_ih_l0", "flows.0.attention_lstm.weight_hh_l0", "flows.0.attention_lstm.bias_ih_l0", "flows.0.attention_lstm.bias_hh_l0", "flows.0.attention_layer.query.linear_layer.weight", "flows.0.attention_layer.key.linear_layer.weight", "flows.0.attention_layer.value.linear_layer.weight", "flows.0.attention_layer.v.linear_layer.weight", "flows.0.dense_layer.layers.0.linear_layer.weight", "flows.0.dense_layer.layers.0.linear_layer.bias", "flows.0.dense_layer.layers.1.linear_layer.weight", "flows.0.dense_layer.layers.1.linear_layer.bias", "flows.0.gate_layer.linear_layer.weight", "flows.0.gate_layer.linear_layer.bias". Unexpected key(s) in state_dict: "decoder.prenet.layers.0.linear_layer.weight", "decoder.prenet.layers.1.linear_layer.weight", "decoder.attention_rnn.weight_ih", "decoder.attention_rnn.weight_hh", "decoder.attention_rnn.bias_ih", "decoder.attention_rnn.bias_hh", "decoder.attention_layer.query_layer.linear_layer.weight", "decoder.attention_layer.memory_layer.linear_layer.weight", "decoder.attention_layer.v.linear_layer.weight", "decoder.attention_layer.location_layer.location_conv.conv.weight", "decoder.attention_layer.location_layer.location_dense.linear_layer.weight", "decoder.decoder_rnn.weight_ih", "decoder.decoder_rnn.weight_hh", "decoder.decoder_rnn.bias_ih", "decoder.decoder_rnn.bias_hh", "decoder.linear_projection.linear_layer.weight", "decoder.linear_projection.linear_layer.bias", "decoder.gate_layer.linear_layer.weight", "decoder.gate_layer.linear_layer.bias", "postnet.convolutions.0.0.conv.weight", "postnet.convolutions.0.0.conv.bias", "postnet.convolutions.0.1.weight", "postnet.convolutions.0.1.bias", "postnet.convolutions.0.1.running_mean", "postnet.convolutions.0.1.running_var", "postnet.convolutions.0.1.num_batches_tracked", "postnet.convolutions.1.0.conv.weight", "postnet.convolutions.1.0.conv.bias", "postnet.convolutions.1.1.weight", "postnet.convolutions.1.1.bias", "postnet.convolutions.1.1.running_mean", "postnet.convolutions.1.1.running_var", "postnet.convolutions.1.1.num_batches_tracked", "postnet.convolutions.2.0.conv.weight", "postnet.convolutions.2.0.conv.bias", "postnet.convolutions.2.1.weight", "postnet.convolutions.2.1.bias", "postnet.convolutions.2.1.running_mean", "postnet.convolutions.2.1.running_var", "postnet.convolutions.2.1.num_batches_tracked", "postnet.convolutions.3.0.conv.weight", "postnet.convolutions.3.0.conv.bias", "postnet.convolutions.3.1.weight", "postnet.convolutions.3.1.bias", "postnet.convolutions.3.1.running_mean", "postnet.convolutions.3.1.running_var", "postnet.convolutions.3.1.num_batches_tracked", "postnet.convolutions.4.0.conv.weight", "postnet.convolutions.4.0.conv.bias", "postnet.convolutions.4.1.weight", "postnet.convolutions.4.1.bias", "postnet.convolutions.4.1.running_mean", "postnet.convolutions.4.1.running_var", "postnet.convolutions.4.1.num_batches_tracked", "encoder.convolutions.0.1.running_mean", "encoder.convolutions.0.1.running_var", "encoder.convolutions.0.1.num_batches_tracked", "encoder.convolutions.1.1.running_mean", "encoder.convolutions.1.1.running_var", "encoder.convolutions.1.1.num_batches_tracked", "encoder.convolutions.2.1.running_mean", "encoder.convolutions.2.1.running_var", "encoder.convolutions.2.1.num_batches_tracked".

Jul 02 '20 04:07 akshay4malik

You should pass only warmstart_checkpoint_path, not checkpoint_path. If you pass checkpoint_path, you will execute the wrong method load_checkpoint. As you said, you'll need to comment the speaker embedding check.

Jul 02 '20 04:07 rafaelvalle

I have not added checkpoint_path, below is the config json "train_config": { "output_directory": "outdir", "epochs": 10000000, "learning_rate": 1e-4, "weight_decay": 1e-6, "sigma": 1.0, "iters_per_checkpoint": 5000, "batch_size": 1, "seed": 1234, "checkpoint_path": "", "ignore_layers": [], "include_layers": ["encoder", "embedding"], "warmstart_checkpoint_path": "warmStartOnTacotron/gpu5_checkpoint_176000", "with_tensorboard": true, "fp16_run": false

Jul 02 '20 04:07 akshay4malik

You mentioned issues when loading the optimizer. The optimizer is only loaded if this condition is satisfied, which then executes load_checkpoint

Jul 02 '20 04:07 rafaelvalle

The same applies to the error you mentioned seeing here. This function is only executed if you pass checkpoint_path.

Jul 02 '20 04:07 rafaelvalle

I got it, I am sorry, I am a little mistake while giving the "train" command. Just when you pointed it out, I looked at the functions. The training has started, But here are something things that I would like to know: Should I start training with n_flow = 1 for around 100,000 iterations and then start again with warm start with n_flow = 2.

Jul 02 '20 04:07 akshay4malik

Does it work when you comment out this and pass a model to warmstart_checkpoint_path?

Jul 02 '20 04:07 rafaelvalle

Does it work when you comment out this and pass a model to warmstart_checkpoint_path?

The error was coming because while running the code, I was giving the checkpoint path. I did not realize until you mentioned this functions will not be called.

Jul 02 '20 04:07 akshay4malik

Great! Let us know once you're able to train with the male hindi voice.

Jul 02 '20 04:07 rafaelvalle

Great! Let us know once you're able to train with the male hindi voice.

Sure, I will inform you for sure. But here is something things that I would like to know: Should I start training with n_flow = 1 for around 100,000 iterations and then start again with warm start with n_flow = 2.

Jul 02 '20 04:07 akshay4malik

Yes! Train with n_flow=1 until attention starts looking good then use the n_flow=1 to warmstart a model with n_flow=2, including all weights from n_flow=1. If include_layers=None it will include all weights, as you can see here.

Jul 02 '20 05:07 rafaelvalle

Yes! Train with n_flow=1 until attention starts looking good then use the n_flow=1 to warmstart a model with n_flow=2, including all weights from n_flow=1. If include_layers=None it will include all weights, as you can see here.

Sure! Thanks a lot for all the help.

Jul 02 '20 05:07 akshay4malik

@rafaelvalle I have one more query, How important is CMUDict while training Flowtron model. As for Hindi language, it is not available, so I have bypassed it. How will it affect the results? However, Tacotron2 model does not face any issue when we bypass the CMUDict.

Jul 03 '20 06:07 akshay4malik

It should not be an issue given that in Hindi there's a one to one correspondence between graphemes and phonemes.

Jul 03 '20 15:07 rafaelvalle

@rafaelvalle I have a similar setup for Hindi, I have trained for about 230k steps but the attention is not aligning. I have a trained tacotron model for Hindi which is working very well. I have used its weight for warm starting flowtron. @akshay4malik did you get good results?

Jul 25 '20 05:07 akashicMarga

@singhaki Yes, I got the attention and the speech generated is fair as well. Instead of warm starting it on tacotron 2 model , try with LJS Specch pretrained model, Which is available publically. And you will have to wait a little longer than 230K steps. On flow-1, you will start getting attention around 0.5 M steps.

Jul 25 '20 11:07 akshay4malik

@akshay4malik were you able to train the model with 2 steps of flow by warm-starting from the model with 1 step of flow you trained on your data?

Jul 25 '20 17:07 rafaelvalle

@rafaelvalle Yes, the training for step 2 is going on, Though in tensorboard, I am getting good attention plot for second step as well, But the audio generated is not good yet. I hope it will improve on further training.

Jul 25 '20 18:07 akshay4malik

losses_500k attention_500k

My loss curves and attention looks like this post-training for 500k steps.should I decrease the learning rate? Any suggestion @akshay4malik @rafaelvalle as audio generated by model is gibberish

Jul 27 '20 06:07 akashicMarga