`word_language_model` Different masking operation in two official tutorials

Open ShengYun-Peng opened this issue 2 years ago • 1 comments

Hi, thanks for the great tutorial on language modeling. A question on masking the input: Why do we mask the input in the encoder layer?

I'm aware that the mask is meant for preventing the attention of future words. As per the original transformer paper, the masked attention is only applied in the transformer decoder layer, but not to the encoder layer. However, in this tutorial, the output of the encoder layer is directly used to predict the next word, so the mask is applied to the encoder layer. In short, is the decoder layer still necessary for language modeling or a decoder with masked attention should be added to the tutorial?

This Pytorch tutorial on language modeling does not have any masking operation. It seems like the major difference between two tutorials are masking.

Jun 21 '23 14:06 ShengYun-Peng

Hi, I saw your issue by chance, I studied transformer a little and use it to do word_language_model once which was not successful......

why dont use transformer-decode in word_language_model? I guess the answer is that we dont need it, because it is not a translation task(eg, english to france), so it is no need to make a link between source data and target data. In other words, it is no need to make the encoder output to be a part of decoder's input. so the model is simple, whose decoder is a simple linear module. the task is to generate the next word for the input word. consider the task like predict the next number in a sine function given a serial sinusoidal data.
I'm aware that the mask is meant for preventing the attention of future words -- you maybe mean positionalencoding,not mask in encoder's forward. positionalencoding is used in both encoder and decoder. you can find it in transformer's structural pic.

i am new to NLP, if the explain above is not correct, just read it once enough...

Mar 28 '24 09:03 AlbertMa123