marlin issues

perfmance

can use one big kernel rather than many small kernel? maybe one kernel faster?

Issues to generate tokens after "get_llama_marlin"

Hi, Thanks for your work! I wonder how can we generate tokens by using model.generate() or model(inputs)? The following code will produce bug while the model is printed. `import transformers...

HaoWuSR

a_sh_rd_delta_o

1

constexpr int a_sh_rd_delta_o = 2 * ((threads / 32) / (thread_n_blocks / 4)); 1. Does the 32 here refer to a warp? 2. What does 4 here mean? 3. What...

Lenan22

questions about slice_col_par

2

` int slice_col_par = (iters * blockIdx.x) / k_tiles; int slice_col = slice_col_par; // int slice_iters; // number of threadblock tiles in the current slice int slice_count = 0; //...

Lenan22

Marlin slower than fp16 on larger batches

2

I have been making some benchmarks with Marlin, but the speed-up is far from what is reported. In fact, it's actually slower than fp16: GPU: A6000 ada ``` matrix_shape: [11008,...

mobicham

Questions about matrix A's layout in shared memory.

Thanks for your wonderful work! I am trying to understand matrix A's layout in shared memory. I think A's shape is `(16 * thread_m_blocks) * (16 * thread_k_blocks)` in shared...

HandH1998

Does Marlin support zero-point quantization?

7

Dear creators of Marlin What a huge performance boost these kernels can bring! I’m super excited about this as the open source community has been lacking kernels that scale. To...

casper-hansen

@efrantar Awesome work -- always enjoy your research on and implementation of efficient model inference. I was hoping that you could shed some light on the logic of the [packing](https://github.com/IST-DASLab/marlin/blob/512f1b1ba39ff708bcc95419f11cfd1285cd31b3/marlin/__init__.py#L102-L140)...

jeromeku

Support for Hopper H100

3

Hi! You've probably already considered this, but would you be able to add support for Hopper H100 GPUs? A100s don't have nearly as much memory bandwidth. Am happy to run...

rosario-purple

[Bug] H800 run UT failed.

3

This setup can not pass UT. Could you please check it ?

Ageliss

marlin
marlin copied to clipboard

Metadata

perfmance

Issues to generate tokens after "get_llama_marlin"

a_sh_rd_delta_o

questions about slice_col_par

Marlin slower than fp16 on larger batches

Questions about matrix A's layout in shared memory.

Does Marlin support zero-point quantization?

[QST] Weight Format & GEMM

Support for Hopper H100

[Bug] H800 run UT failed.

← Metadata

Owner

Metadata

marlin marlin copied to clipboard

Metadata

← Metadata

Owner

Metadata

marlin
marlin copied to clipboard