Elias Frantar
Elias Frantar
Hi, I just cloned the repository and for me everything seems to be working fine. Can you perhaps provide some more details of what you have been doing and what...
Hi, as for what settings you should use: if you don't use a move limit with `-l`, then the solver will simply search for the full `-m` milliseconds and return...
If the batchsize is larger than 64, we essentially process multiple batchsize 64 matmuls in a single kernel invocations (to allow better partitioning). This is done by virtually replicating the...
Hi, Marlin is primarily optimized for generative inference (with a few tokens at-a-time), which is actually memory-bound and can hence be sped up via weight-quantization; e.g. input shapes of (16,...
Hi, Marlin only uses `ldmatrix` for the activations, as the weights are already preshuffled optimally for both dequantization and tensor core fragment layouts. You can find some more detailed description...
Hi, in general, my experience is that when GPTQ is tuned and configured properly (e.g., also uses grid-clipping), results are extremely similar to AWQ. That being said, Marlin is a...
Hi, unfortunately, I don't have access to any H800s (or any Hopper GPUs for that matter), so it is a bit hard to test. Which of the matrix shapes are...
Hi, the L2 cache is used implicitly whenever global memory is fetched; the immediate eviction cache policy for weight loads is defined [here](https://github.com/IST-DASLab/marlin/blob/b930c72208a0ddf01f3604bd32b685471fe4c70d/marlin/marlin_cuda_kernel.cu#L68). The key is that we want to...
Hi, currently Marlin supports only a limited set of quantization options (4bit + groupsize 128), selected for a good accuracy/speed trade-off, but therefore at very close to peak efficiency in...
Hi, Marlin does not use any INT4 tensor cores, 4-bit weights are decompressed on-the-fly and then the actual computation is carried out in FP16. The reason Turning is not support...