James Fleming issues

Repositories
Issues
Comments

Results 3 issues of


                                            James Fleming

Fix clang and Android warnings about && and overrides.

Fix four methods that needed `override` specifiers, and put parentheses around a big && and II expression to silence the warning.

Remove post flattening CUDA `clone()`s, for 2% speedup in a 1x16 7B llama2

I found these clones while porting the CUDA kernels to vLLM. I couldn't see what they were for (avoid memory fragmentation?) but got a 2% speed improvement on your llama2...

AQLM CUDA support

SUMMARY: Supports AQLM compressed inference, see https://github.com/Vahe1994/AQLM https://arxiv.org/pdf/2401.06118.pdf Optimized supported formats are 1x16 and 2x8. Tensor parallelism is supported. Only CUDA kernels are provided. Formats other than 1x16 and 2x8...

quantization