cosFormer
cosFormer copied to clipboard
Why cosformer not work on XL-base transformer architecture?
When implementing cosformer on MultiHeadAttention in Transformer-XL and running without extra long-range memory, the ReLU performance is worse than eLU. I think it is because the Attention and FF Net are different since XL-like transformer has different layer norm and residual connection. Why this ReLU(Q)ReLU(K).T softmax replacement is not robust on different transformer architectures?