Dan Fu

Results 103 comments of Dan Fu

This may actually introduce some correctness issues IIUC (especially removing the stream sync). There does need to be some stream and management in the dispatch. There's another set of changes...

Yep these two lines are the things we need: https://github.com/flashinfer-ai/flashinfer/blob/0a754ce4fcae45fb0ce231de0bb03bc796bb44b3/csrc/norm.cu#L67-L68. The tradeoff is it makes compile more expensive, so ideally we gate it behind a compiler flag. I have a...

Take a look at this branch: https://github.com/HazyResearch/ThunderKittens/tree/danfu09/update-attn Any other optimizations you see there? It's pretty old code at this point :)

I'm not very familiar with ONNX - what you would need for the long convolution to do it efficiently is an FFT operation. Out of curiosity, can you describe the...

See https://github.com/HazyResearch/ThunderKittens/tree/main/kernels/attn/demo/mla_decode and the accompanying blog: https://hazyresearch.stanford.edu/blog/2025-03-04-thundermla

I recommend using the NVIDIA PyTorch Docker containers: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch We ran our experiments on version 23.05: https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html PyTorch 2.0.0, CUDA 12.1.1. However, the MLP layers in Monarch Mixer are not...

Yes, the MLP layers are vanilla PyTorch and do not use FlashFFTConv. If you want to reproduce all the experiments without Docker, you’ll have to install all the dependencies as...

I recommend keeping the expansion factor to a power of 2 to begin and see how things work. You may have to change around other hyperparameters to account for model...

Please print out `self.linear.weight.shape` - that will help debug the problem. On Tue, Feb 27, 2024 at 10:00 AM blazar ***@***.***> wrote: > Something like that? > > dim =...