Elias Frantar comments

Results 11 comments of


                                            Elias Frantar

-DF5 option seems not to work

Hi, I just cloned the repository and for me everything seems to be working fine. Can you perhaps provide some more details of what you have been doing and what...

-DF5 option seems not to work

Hi, as for what settings you should use: if you don't use a move limit with `-l`, then the solver will simply search for the full `-m` milliseconds and return...

questions about slice_col_par

If the batchsize is larger than 64, we essentially process multiple batchsize 64 matmuls in a single kernel invocations (to allow better partitioning). This is done by virtually replicating the...

Marlin slower than fp16 on larger batches

Hi, Marlin is primarily optimized for generative inference (with a few tokens at-a-time), which is actually memory-bound and can hence be sped up via weight-quantization; e.g. input shapes of (16,...

Hi, Marlin only uses `ldmatrix` for the activations, as the weights are already preshuffled optimally for both dequantization and tensor core fragment layouts. You can find some more detailed description...

Does Marlin support zero-point quantization?

Hi, in general, my experience is that when GPTQ is tuned and configured properly (e.g., also uses grid-clipping), results are extremely similar to AWQ. That being said, Marlin is a...

[Bug] H800 run UT failed.

Hi, unfortunately, I don't have access to any H800s (or any Hopper GPUs for that matter), so it is a bit hard to test. Which of the matrix shapes are...

Where in the code uses "immediate eviction" and "fetched from L2 cache"??

Hi, the L2 cache is used implicitly whenever global memory is fetched; the immediate eviction cache policy for weight loads is defined [here](https://github.com/IST-DASLab/marlin/blob/b930c72208a0ddf01f3604bd32b685471fe4c70d/marlin/marlin_cuda_kernel.cu#L68). The key is that we want to...

can this support lower bit quant?

Hi, currently Marlin supports only a limited set of quantization options (4bit + groupsize 128), selected for a good accuracy/speed trade-off, but therefore at very close to peak efficiency in...

Turing support

Hi, Marlin does not use any INT4 tensor cores, 4-bit weights are decompressed on-the-fly and then the actual computation is carried out in FP16. The reason Turning is not support...