Dan Fu comments

Results 103 comments of


                                            Dan Fu

LoCo Benchmark - BM25 & Insights

Hi Elliott, thanks for the interest! We have an updated LoCoV1 described in the arXiv ( https://arxiv.org/abs/2402.07440v2) - will have it on HF with updated checkpoints soon (we ran into...

LoCo Benchmark - BM25 & Insights

@iNeil77 here you go, Jon's tweet and blog has links: https://x.com/JonSaadFalcon/status/1792623213698232808

FlashFFTConv can be definitely be implemented on Mamba, right?

Mamba does not have a convolutional form, so there isn't an exact mapping. For Mamba you'll have to use the scan formulation as documented in the [paper](https://arxiv.org/abs/2312.00752).

Non-causal implementation of language model for synthetic datasets

This synthetic was designed to figure out the gap on causal language modeling and was originally used for a state space model (H3) which is naturally causal by design. We...

Non-causal implementation of language model for synthetic datasets

My intuition is that the next-token prediction actually makes it easier for the model to learn the circuit it needs to do associative recall, since there's more tokens it needs...

Non-causal implementation of language model for synthetic datasets

For the M2-BERT synthetics, we ran a non-causal form of associative recall to fine-tune the architecture. For induction head - the same analogy applies for what we implemented here, but...

RuntimeError: u must have shape (batch_size, H, L)

The module already supports multi-head - you can find an example in the H3 code: https://github.com/HazyResearch/safari/blob/main/src/models/sequence/h3.py#L160 In H3, the names of the three branches (what Hyena calls `x[0]`, `x[1]`, and...

Embedding speed seems slow

Hm interesting! This is definitely slower than it should be :) Can you give some more details on your environment? (GPU, version of PyTorch, etc) If you're using FlashFFTConv, a...

monarch_cuda_interface_fwd_bf16.cu failed with invalid argument (1).

Thanks for this bug report! This is because the RTX series has less SRAM than A100/H100 (99 KB vs. 163/227 KB), which I didn't check for during development. You should...

Reason for doubling length in both LongConv and FlashFFTConv implementations

These are to create a causal convolution. If you make the kernel length equal to 2L, you get a bidirectional convolution. If the kernel is length L, it’s called a...