llama3.java icon indicating copy to clipboard operation
llama3.java copied to clipboard

Implementation of Q6_KFloatTensor

Open srogmann opened this issue 1 year ago • 2 comments

This PR contains a Q6_K implementation.

  • Model: bartowski/Meta-Llama-3.1-8B-Instruct-GGUF, Q6_K
  • CPU: AMD Ryzen 9 7900X
  • JVM: OpenJDK 64-Bit Server VM
  • Linux: 6.9.7-arch1-1
Quant Species Speed
Q6_K S_128_BIT 0.22 tokens/s
Q6_K S_256_BIT (non-array) 0.47 tokens/s, 0.10 tokens/s
Q6_K S_256_BIT (array) 1.26 tokens/s
Q6_K S_256_BIT (512 bits) 0.29 tokens/s
  • Model: bartowski/Meta-Llama-3.1-8B-Instruct-GGUF, Q8
Quant Species Speed
Q8_0 S_128_BIT 4.02 tokens/s
Q8_0 S_256_BIT 5.80 tokens/s

srogmann avatar Aug 10 '24 22:08 srogmann

I experimented running this on a patched Graal compiler with partial Vector API support. I focused on vectorDot256 because that's the most likely to be compiled properly... I reached quite far, everything is compiled properly until the last large block with the sums where I get an exception in the compiler... The bug seems to be in the compiler internal tracking of the vectors... not because of missing features. I believe that, with minor fixes, Graal will be able to properly compile this. I'll keep you posted.

mukel avatar Aug 12 '24 07:08 mukel

Did you try vectorDot256Array?

srogmann avatar Aug 12 '24 20:08 srogmann