Doraemonzzz comments

Results 16 comments of


                                            Doraemonzzz

Attn Mask for Non-causal Models

> We are examining non-NLP applications of the cosformer self-attention, and would need to use attention masking for the padded tokens in the batch. Is there a way to incorporate...

Why the attn mask is not used in forward function?

When use forward() function, there is no direct way to use attention mask since we haven't compute attention matrix. If you need use attention mask, we suggest you use left_product,...

Replicating Results?

Hello, I would like to ask if there are any experiments on short convolutions? In my experiments, short convolutions had no effect at all.

> All the Taylor-exp attention experiments in the wandb report I shared use the [BaseConv](https://github.com/HazyResearch/zoology/blob/main/based_refs/gated_conv_ref.py) in place of attention in every second Transformer block. This aligns with the Zoology repo...

Replicating Results?

> @fattorib yea, the convs were the least interesting part of their architecture > > didn't expect much from it. try the gateloop + linear attention though (keeping the feedforwards)...

Replicating Results?

My biggest question about short conv is, if it doesn't work at all, why do Base, Mamba, and RWKV all adopt this operation?

Replicating Results?

> @Doraemonzzz my advice is, unless if you doubt your own experimental techniques, always trust what you see with your own eyes over the claims of a paper Thank you...

Benchmark results can not be reproduced

Hello, there are some minor bugs in the current model's testing, and we are currently fixing them. For now, you can resolve this issue by adding the following command. ```...

Bugs in Triton operator?

Thank you for your feedback, but I believe this issue would be more appropriately raised at https://github.com/OpenNLPLab/lightning-attention. Could you please open the same issue in the lightning-attention repository? I will...

LieRE: Generalizing Rotary Position Encodings. Beats RoPE-mixed by large margin and is much faster (compute-wise)

Hi, great job. I am the author of [Lrpe](https://arxiv.org/abs/2307.09270). I would like to ask how the author perceives the differences between LieRE and Lrpe. Let me briefly explain Lrpe here:...