Doraemonzzz
Doraemonzzz
> We are examining non-NLP applications of the cosformer self-attention, and would need to use attention masking for the padded tokens in the batch. Is there a way to incorporate...
When use forward() function, there is no direct way to use attention mask since we haven't compute attention matrix. If you need use attention mask, we suggest you use left_product,...
Hello, I would like to ask if there are any experiments on short convolutions? In my experiments, short convolutions had no effect at all.
> All the Taylor-exp attention experiments in the wandb report I shared use the [BaseConv](https://github.com/HazyResearch/zoology/blob/main/based_refs/gated_conv_ref.py) in place of attention in every second Transformer block. This aligns with the Zoology repo...
> @fattorib yea, the convs were the least interesting part of their architecture > > didn't expect much from it. try the gateloop + linear attention though (keeping the feedforwards)...
My biggest question about short conv is, if it doesn't work at all, why do Base, Mamba, and RWKV all adopt this operation?
> @Doraemonzzz my advice is, unless if you doubt your own experimental techniques, always trust what you see with your own eyes over the claims of a paper Thank you...
Hello, there are some minor bugs in the current model's testing, and we are currently fixing them. For now, you can resolve this issue by adding the following command. ```...
Thank you for your feedback, but I believe this issue would be more appropriately raised at https://github.com/OpenNLPLab/lightning-attention. Could you please open the same issue in the lightning-attention repository? I will...
Hi, great job. I am the author of [Lrpe](https://arxiv.org/abs/2307.09270). I would like to ask how the author perceives the differences between LieRE and Lrpe. Let me briefly explain Lrpe here:...