Dan Fu
Dan Fu
Hi Elliott, thanks for the interest! We have an updated LoCoV1 described in the arXiv ( https://arxiv.org/abs/2402.07440v2) - will have it on HF with updated checkpoints soon (we ran into...
@iNeil77 here you go, Jon's tweet and blog has links: https://x.com/JonSaadFalcon/status/1792623213698232808
Mamba does not have a convolutional form, so there isn't an exact mapping. For Mamba you'll have to use the scan formulation as documented in the [paper](https://arxiv.org/abs/2312.00752).
This synthetic was designed to figure out the gap on causal language modeling and was originally used for a state space model (H3) which is naturally causal by design. We...
My intuition is that the next-token prediction actually makes it easier for the model to learn the circuit it needs to do associative recall, since there's more tokens it needs...
For the M2-BERT synthetics, we ran a non-causal form of associative recall to fine-tune the architecture. For induction head - the same analogy applies for what we implemented here, but...
The module already supports multi-head - you can find an example in the H3 code: https://github.com/HazyResearch/safari/blob/main/src/models/sequence/h3.py#L160 In H3, the names of the three branches (what Hyena calls `x[0]`, `x[1]`, and...
Hm interesting! This is definitely slower than it should be :) Can you give some more details on your environment? (GPU, version of PyTorch, etc) If you're using FlashFFTConv, a...
Thanks for this bug report! This is because the RTX series has less SRAM than A100/H100 (99 KB vs. 163/227 KB), which I didn't check for during development. You should...
These are to create a causal convolution. If you make the kernel length equal to 2L, you get a bidirectional convolution. If the kernel is length L, it’s called a...