A faster version for Q4_1 x Q8_0 dot products
The idea behind being that Q8_0 quantized
values get used many times in the matrix multiplications where they are involved. In the current implementations, when we are evaluating the dot products, we need to compute the sum of the quants in the Q8_0 vector, so the same operation is repeated many times. This makes the Q4_1 * Q8_0 dot product significantly slower than Q4_0 * Q8_0 (~80%).
In the PR the sum of Q8_0 quants is computed during quantization and stored it in the
now modified block_q8_0 struct. It is then reused in the subsequent dot products.
In a synthetic benchmark (just compute a bunch of dot products, see q8dot.cpp), this change speeds up the Q4_1 * Q8_0 dot product by 80%, making the performance identical to Q4_0 * Q8_0.
In practical application, I see a ~15% gain in speed for token prediction on M2, and ~5% gain on Ryzen 7950X. The speed gain in the prompt evaluation is much bigger (around 50%).
I have only done the change for the scalar version, ARM_NEON, and AVX2, so we still need an AVX implementation.
Here are some results on M1 Pro:
Using 4 threads:
# command
make -j && ./main -m ./models/7B/ggml-model-q4_1.bin -p "I believe the meaning of life is" -c 2048 -n 512 -t 8 --ignore-eos -s 3 -n 64 -t 4
# master
llama_print_timings: sample time = 56.76 ms / 64 runs ( 0.89 ms per run)
llama_print_timings: prompt eval time = 844.74 ms / 8 tokens ( 105.59 ms per token)
llama_print_timings: eval time = 5959.12 ms / 63 runs ( 94.59 ms per run)
llama_print_timings: total time = 6870.81 ms
# faster_q41_q80_dot_product
llama_print_timings: sample time = 46.55 ms / 64 runs ( 0.73 ms per run)
llama_print_timings: prompt eval time = 547.04 ms / 8 tokens ( 68.38 ms per token)
llama_print_timings: eval time = 3842.15 ms / 63 runs ( 60.99 ms per run)
llama_print_timings: total time = 4445.57 ms
Using 8 threads:
# command
make -j && ./main -m ./models/7B/ggml-model-q4_1.bin -p "I believe the meaning of life is" -c 2048 -n 512 -t 8 --ignore-eos -s 3 -n 64 -t 8
# master
llama_print_timings: sample time = 56.65 ms / 64 runs ( 0.89 ms per run)
llama_print_timings: prompt eval time = 521.86 ms / 8 tokens ( 65.23 ms per token)
llama_print_timings: eval time = 3471.47 ms / 63 runs ( 55.10 ms per run)
llama_print_timings: total time = 4060.30 ms
# faster_q41_q80_dot_product
llama_print_timings: sample time = 46.56 ms / 64 runs ( 0.73 ms per run)
llama_print_timings: prompt eval time = 362.70 ms / 8 tokens ( 45.34 ms per token)
llama_print_timings: eval time = 3416.20 ms / 63 runs ( 54.23 ms per run)
llama_print_timings: total time = 3835.39 ms
At 4 threads the performance gain is much more pronounced, even for token eval time.
The prompt eval time is indeed significantly faster - best measured with large prompt and LLAMA_NO_ACCELERATE=1 to avoid offloading to the AMX coprocessor
Here on AVX2 / 4 cores this is looking good: master 232ms/token, your PR 223ms/token. Prompt eval seems to improve more, as you said, but I haven't looked at that closely.
But please clean up the commented-out code.
// There is not better way of doing this???
Horizontal sums really aren't what AVX is good at; I couldn't think of anything better. To anyone looking into this, here are the two decent stackoverflow threads:
- https://stackoverflow.com/questions/60108658/fastest-method-to-calculate-sum-of-all-packed-32-bit-integers-using-avx512-or-av
- https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction
Horizontal sums
This is slightly faster on my alder lake CPU. Dunno if it's faster in general.
static inline float horizontalSum(__m256i a) {
__m256i b = _mm256_castps_si256(_mm256_movehdup_ps(_mm256_castsi256_ps(a)));
__m256i sum = _mm256_add_epi32(a, b);
__m256i hi = _mm256_unpackhi_epi64(sum, sum);
sum = _mm256_add_epi32(sum, hi);
return _mm256_cvtsi256_si32(sum) + _mm256_extract_epi32(sum, 4);
}