ben fattori

Results 10 comments of ben fattori

I have a follow-up question about this. Currently the readme suggests that using the following pipeline will automatically split by node and worker: ```python dataset = wds.DataPipeline( wds.ResampledShards("source-{000000..000999}.tar"), wds.non_empty, wds.tarfile_samples,...

@karpathy With respect to weight decay, I was under the impression that the token embeddings were regularized too, and only bias terms were excluded. Unfortunately, the only source I have...

```configure_optimizers``` error has been fixed. I'm seeing matching logits outputs between NanoGPT/HF with the following snippet: ```python import torch from model import GPT from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer =...

@lucidrains Thanks for the detailed response. Currently using head dim of 16, I also tried bumping up the value head dimension to 64, but no luck there. Even interleaving the...

@lucidrains I also tried scaling the number of heads from 12 to 50, without any major performance improvement... I'll give the token-shifting a try too.

Sharing a WandB report comparing Taylor vs Linear vs Softmax Attention [here](https://api.wandb.ai/links/bfattori/85iwah1s). I also have a [second report](https://api.wandb.ai/links/bfattori/7qocwx43) where I tried increasing the number of heads to 48 on a...

Thank you! I'll check the repo out.

Both have 12 heads

All the Taylor-exp attention experiments in the wandb report I shared use the [BaseConv](https://github.com/HazyResearch/zoology/blob/main/based_refs/gated_conv_ref.py) in place of attention in every second Transformer block. This aligns with the Zoology repo in...

I know Griffin/RecurrentGemma are a bit old now, but if there is still interest in this, I'd be happy to port over the fully fused RG-LRU kernel I wrote [here](https://github.com/fattorib/hawk-pytorch/blob/main/hawk/scan_fused.py)...