Dan Fu comments

Results 103 comments of


                                            Dan Fu

Remove unnecessary device and stream syncs

This may actually introduce some correctness issues IIUC (especially removing the stream sync). There does need to be some stream and management in the dispatch. There's another set of changes...

Remove unnecessary device and stream syncs

Yep these two lines are the things we need: https://github.com/flashinfer-ai/flashinfer/blob/0a754ce4fcae45fb0ce231de0bb03bc796bb44b3/csrc/norm.cu#L67-L68. The tradeoff is it makes compile more expensive, so ideally we gate it behind a compiler flag. I have a...

Remove unnecessary device and stream syncs

Take a look at this branch: https://github.com/HazyResearch/ThunderKittens/tree/danfu09/update-attn Any other optimizations you see there? It's pretty old code at this point :)

does M2 work with ONNX?

I'm not very familiar with ONNX - what you would need for the long convolution to do it efficiently is an FFT operation. Out of curiosity, can you describe the...

Can we enable split_kv feature to the FA kernel?

See https://github.com/HazyResearch/ThunderKittens/tree/main/kernels/attn/demo/mla_decode and the accompanying blog: https://hazyresearch.stanford.edu/blog/2025-03-04-thundermla

About the versions of PyTorch, CUDA, and other dependencies used in the implementation of the Monarch Mixer

I recommend using the NVIDIA PyTorch Docker containers: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch We ran our experiments on version 23.05: https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html PyTorch 2.0.0, CUDA 12.1.1. However, the MLP layers in Monarch Mixer are not...