mage
mage copied to clipboard
About the computation of logits after decoder
Hi @LTH14 ,
Great work! Here I have a question about the computation of logits after decoder. I find that a MlmLayer is used. The output of decoder is mapped by a fc layer, and then the dot product between the mapped features and word embeddings are obtained, which is added with bias and used as the logits.
Have you tried to get the logits directly using a fc layer (upon the output feature of decoder)? What's the main difference between these two types of logits? And which one do you think is better?
Thanks.
We follow BERT for this design. I haven't tried using logits directly from an fc layer.