flowtron icon indicating copy to clipboard operation
flowtron copied to clipboard

Bad attention weights

Open dmitrii-obukhov opened this issue 5 years ago • 23 comments

Hello

I am trying to train Flowtron on LJSpeech Unfortunately after 24 hours of training, the attention weights still bad Server configuration: 4 instances with 8xV100

image

image

Do you have any ideas?

dmitrii-obukhov avatar Jul 09 '20 13:07 dmitrii-obukhov

Yes, train it by progressively adding steps of flow as the model learns to attend on each step of flow. Start with 1 step of flow, train it until it learns attention, use this model to warm-start a model with 2 steps of flow, and so on...

rafaelvalle avatar Jul 09 '20 13:07 rafaelvalle

@rafaelvalle Thanks for your reply

I ran new training with option model_config.n_flows=1, but after 16 hours attention weights look bad again:

image

In one of the threads I read that good alignment is produced in less than 24 hours.

So, what could be wrong?

dmitrii-obukhov avatar Jul 10 '20 07:07 dmitrii-obukhov

Can you share your tensorboard plots?

rafaelvalle avatar Jul 10 '20 14:07 rafaelvalle

Yes

image

image

dmitrii-obukhov avatar Jul 10 '20 15:07 dmitrii-obukhov

Does it have good attention around 60k iters ?

rafaelvalle avatar Jul 10 '20 16:07 rafaelvalle

No. Attention on all iterations looks the same

dmitrii-obukhov avatar Jul 10 '20 16:07 dmitrii-obukhov

Make sure you trim silences from the beginning and end of your audio files

rafaelvalle avatar Jul 10 '20 16:07 rafaelvalle

I use LJSpeech dataset for training. Any instructions on how to trim them?

Could the problem be that I use distributed training?

Also, I set the flag fp16_run=true

dmitrii-obukhov avatar Jul 10 '20 17:07 dmitrii-obukhov

Make sure you trim silences from the beginning and end of your audio files

Should there be no silence at all in the beginning and end, or should there be at least, let's say 0.1 seconds of silence?

adrianastan avatar Jul 13 '20 10:07 adrianastan

I use LJSpeech dataset for training. Any instructions on how to trim them?

The simplest way would be to use librosa.effects.trim()

adrianastan avatar Jul 13 '20 10:07 adrianastan

there should be no silence at all at the beginning and end of each audio file. sox and librosa.effects.trim can be used to trim silences from beginning and end

rafaelvalle avatar Jul 13 '20 16:07 rafaelvalle

I have similar problem

I use LJSpeech dataset for training. Any instructions on how to trim them?

Could the problem be that I use distributed training?

Also, I set the flag fp16_run=true

Have you solved this problem? Also I tried to predict not mel spectrogram but lpc spectrogram but always got such picture, does anybody know what is the problem? image

kurbobo avatar Jul 17 '20 10:07 kurbobo

The problem remained unresolved. I tried to trim silences from the beginning and end of audio files with the librosa.effects.trim(), but the picture remains the same.

dmitrii-obukhov avatar Jul 18 '20 18:07 dmitrii-obukhov

@kurbob does the attention map always look like that? You might have to change from byte to bool https://github.com/NVIDIA/flowtron/blob/master/flowtron.py#L33

rafaelvalle avatar Jul 29 '20 18:07 rafaelvalle

@adrianastan There should be no silence at the beginning or at the end of an audio file.

rafaelvalle avatar Jul 29 '20 18:07 rafaelvalle

@rafaelvalle Can you tell me what the meaning of no silence is? If i use librosa.effects.trim(), top_db should be set to what. For my data set, if I set top_db to 20, some sounds will also be cut off. setting it a little higher, it seems that some audio files still have silence at the beginning.

zjFFFFFF avatar Aug 04 '20 05:08 zjFFFFFF

@kurbobo The problem was solved when I used encoder and embedding layers from pretrained model.

@zjFFFFFF In my case top_db = 30 works well enough

dmitrii-obukhov avatar Aug 11 '20 06:08 dmitrii-obukhov

@dLeos

In fact, I got the same plot as you(training from scratch). But the validation loss does not seem to affect the experimental results. Between iterations: 800,000-950,000(when the iteration is 1,000,000, i can't get acceptable results), the model can generate acceptable sounds. So you can try different checkpoints one by one.

zjFFFFFF avatar Aug 13 '20 02:08 zjFFFFFF

@kurbob does the attention map always look like that? You might have to change from byte to bool https://github.com/NVIDIA/flowtron/blob/master/flowtron.py#L33 @rafaelvalle No, it's not always appears and I had already fixed the "type/bool" problem before train, but nevertheless sometimes such problem happens. I have one more question: am I right, that flowtron in this repo converts every sentence to arpabet transciption and then train to map sequence of arpabet transcriptions to sqauence of frequency frames?

kurbobo avatar Aug 13 '20 13:08 kurbobo

@kurbob does the attention map always look like that? You might have to change from byte to bool https://github.com/NVIDIA/flowtron/blob/master/flowtron.py#L33 @rafaelvalle No, it's not always appears and I had already fixed the "type/bool" problem before train, but nevertheless sometimes such problem happens. I have one more question: am I right, that flowtron in this repo converts every sentence to arpabet transciption and then train to map sequence of arpabet transcriptions to sqauence of frequency frames?

@rafaelvalle

kurbobo avatar Sep 03 '20 09:09 kurbobo

@kurbobo @rafaelvalle I tried mels , train n-flows =1 first and then use the model to warmup n-flows = 2 model , the two alignment weights are both right, and the wavs synthsized are good. But to lpc parameter that used in lpcnet vocoder, when n-flows=1, everythings seems good, loss is good, alignment is right, however when I train n-flows=2 with the trained n-flows=1 model as warmup, the second alignment failed, and the loss just vibrate without any descend.

Liujingxiu23 avatar Sep 18 '20 01:09 Liujingxiu23

@Liujingxiu23 please share training, validations losses and attention for 1 step of flow model and 2 steps of flow model.

rafaelvalle avatar Sep 18 '20 04:09 rafaelvalle

Did you warmup the 2 flows model with the 1 flow model from a checkpoint around 200k?

rafaelvalle avatar Sep 18 '20 20:09 rafaelvalle