Better explanation of "I failed to train the fastspeech"
Could you elaborate a little more and maybe propose a solution to the problem you raised?
(2020/02/10) I was able to finish this implementation by completing the Stop token prediction and remove the concatenation of inputs and outputs of multihead attention. However, the alignments of this implementation are less diagonal, so it can not generate proper alignments for fastspeech As a result, I failed to train the fastspeech with this implementation :(
According to the writers of the fastspeech, it is important to use proper alignments in the training.
When I implemented transformer-tts at first, I failed to implement it perfectly, and so by concatenating the input and output in multi-head-self-attention, I finished it.
I assume that thanks to this concatenation, the encoder-decoder alignments were more diagonal, and I can use around 6,000 data instances among 13,100 data instances.
However, when I correct my implementation and implement the original transformer-tts nearly perfectly, I can only use about 1,000 data instances for the fastspeech training, so audio quality is much worse than before.