saint
saint copied to clipboard
Implementation of Attention module in Transformer
Thank you for sharing your work, it has actually been helping me a lot. I have a problem with your code relating Attention module of Transformer. May I be wrong that the Attention module should have dropout layer after softmax function (link). For example, link or link, they used dropout layer in Attention module.
What you mentioned is indeed a common practice