llama3.java
llama3.java copied to clipboard
Implementation of Q6_KFloatTensor
This PR contains a Q6_K implementation.
- Model: bartowski/Meta-Llama-3.1-8B-Instruct-GGUF, Q6_K
- CPU: AMD Ryzen 9 7900X
- JVM: OpenJDK 64-Bit Server VM
- Linux: 6.9.7-arch1-1
| Quant | Species | Speed |
|---|---|---|
| Q6_K | S_128_BIT | 0.22 tokens/s |
| Q6_K | S_256_BIT (non-array) | 0.47 tokens/s, 0.10 tokens/s |
| Q6_K | S_256_BIT (array) | 1.26 tokens/s |
| Q6_K | S_256_BIT (512 bits) | 0.29 tokens/s |
- Model: bartowski/Meta-Llama-3.1-8B-Instruct-GGUF, Q8
| Quant | Species | Speed |
|---|---|---|
| Q8_0 | S_128_BIT | 4.02 tokens/s |
| Q8_0 | S_256_BIT | 5.80 tokens/s |
I experimented running this on a patched Graal compiler with partial Vector API support. I focused on vectorDot256 because that's the most likely to be compiled properly... I reached quite far, everything is compiled properly until the last large block with the sums where I get an exception in the compiler... The bug seems to be in the compiler internal tracking of the vectors... not because of missing features. I believe that, with minor fixes, Graal will be able to properly compile this. I'll keep you posted.
Did you try vectorDot256Array?