marlin icon indicating copy to clipboard operation
marlin copied to clipboard

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Results 35 marlin issues
Sort by recently updated
recently updated
newest added

can use one big kernel rather than many small kernel? maybe one kernel faster?

Hi, Thanks for your work! I wonder how can we generate tokens by using model.generate() or model(inputs)? The following code will produce bug while the model is printed. `import transformers...

constexpr int a_sh_rd_delta_o = 2 * ((threads / 32) / (thread_n_blocks / 4)); 1. Does the 32 here refer to a warp? 2. What does 4 here mean? 3. What...

` int slice_col_par = (iters * blockIdx.x) / k_tiles; int slice_col = slice_col_par; // int slice_iters; // number of threadblock tiles in the current slice int slice_count = 0; //...

I have been making some benchmarks with Marlin, but the speed-up is far from what is reported. In fact, it's actually slower than fp16: GPU: A6000 ada ``` matrix_shape: [11008,...

Thanks for your wonderful work! I am trying to understand matrix A's layout in shared memory. I think A's shape is `(16 * thread_m_blocks) * (16 * thread_k_blocks)` in shared...

Dear creators of Marlin What a huge performance boost these kernels can bring! I’m super excited about this as the open source community has been lacking kernels that scale. To...

@efrantar Awesome work -- always enjoy your research on and implementation of efficient model inference. I was hoping that you could shed some light on the logic of the [packing](https://github.com/IST-DASLab/marlin/blob/512f1b1ba39ff708bcc95419f11cfd1285cd31b3/marlin/__init__.py#L102-L140)...

Hi! You've probably already considered this, but would you be able to add support for Hopper H100 GPUs? A100s don't have nearly as much memory bandwidth. Am happy to run...

This setup can not pass UT. Could you please check it ?