Why cosformer not work on XL-base transformer architecture?

Open lwaekfjlk opened this issue 3 years ago • 0 comments

When implementing cosformer on MultiHeadAttention in Transformer-XL and running without extra long-range memory, the ReLU performance is worse than eLU. I think it is because the Attention and FF Net are different since XL-like transformer has different layer norm and residual connection. Why this ReLU(Q)ReLU(K).T softmax replacement is not robust on different transformer architectures?

Jul 15 '22 13:07 lwaekfjlk