Implementation of Q6_KFloatTensor

Open srogmann opened this issue 1 year ago • 2 comments

This PR contains a Q6_K implementation.

Model: bartowski/Meta-Llama-3.1-8B-Instruct-GGUF, Q6_K
CPU: AMD Ryzen 9 7900X
JVM: OpenJDK 64-Bit Server VM
Linux: 6.9.7-arch1-1

Quant	Species	Speed
Q6_K	S_128_BIT	0.22 tokens/s
Q6_K	S_256_BIT (non-array)	0.47 tokens/s, 0.10 tokens/s
Q6_K	S_256_BIT (array)	1.26 tokens/s
Q6_K	S_256_BIT (512 bits)	0.29 tokens/s

Model: bartowski/Meta-Llama-3.1-8B-Instruct-GGUF, Q8

Quant	Species	Speed
Q8_0	S_128_BIT	4.02 tokens/s
Q8_0	S_256_BIT	5.80 tokens/s

Aug 10 '24 22:08 srogmann

I experimented running this on a patched Graal compiler with partial Vector API support. I focused on vectorDot256 because that's the most likely to be compiled properly... I reached quite far, everything is compiled properly until the last large block with the sums where I get an exception in the compiler... The bug seems to be in the compiler internal tracking of the vectors... not because of missing features. I believe that, with minor fixes, Graal will be able to properly compile this. I'll keep you posted.

Aug 12 '24 07:08 mukel

Did you try vectorDot256Array?

Aug 12 '24 20:08 srogmann