James Fleming

Results 3 issues of James Fleming

Fix four methods that needed `override` specifiers, and put parentheses around a big && and II expression to silence the warning.

I found these clones while porting the CUDA kernels to vLLM. I couldn't see what they were for (avoid memory fragmentation?) but got a 2% speed improvement on your llama2...

SUMMARY: Supports AQLM compressed inference, see https://github.com/Vahe1994/AQLM https://arxiv.org/pdf/2401.06118.pdf Optimized supported formats are 1x16 and 2x8. Tensor parallelism is supported. Only CUDA kernels are provided. Formats other than 1x16 and 2x8...

quantization