Question about prepare alignments
Hi,
In prepare_fastspeech.ipynb file,
about
F = torch.mean(torch.max(alignments, dim=-1)[0], dim=-1)
r, c = torch.argmax(F).item()//4, torch.argmax(F).item()%4
location = torch.max(alignments[r,c], dim=1)[1]
My understanding is: In the first line, the tensor shape changed from (layer_num, target_length, source_length) to (layer_num, target_length), and to (layer_num). But I don't understand what's the mean of "4", and why use the layer num to calculate the location?
If there is a problem with my understanding, thanks for pointing out.
Hello, @chynphh
4 means the number of heads used to multihead-attention. If you edit the return value of multihead attention in pytorch, you can get the attention with (layer_num, head_num, target_length, source_length) shape.
Consequently, r and c means n_layers and n_heads. Hope that this comment be helpful to you.
Sincerely,