llama.cpp ggml: optimize some vec dot functions for LoongArch ASX

ggml : optimize ggml_vec_dot_iq4_xs_q8_K for LoongArch ASX
ggml : optimize mul_sum_i8_pairs_float for LoongArch ASX
ggml : optimize ggml_vec_dot_q2_K_q8_K for LoongArch ASX
ggml : optimize ggml_vec_dot_q5_K_q8_K for LoongArch ASX
ggml : optimize ggml_vec_dot_q6_K_q8_K for LoongArch ASX
ggml : optimize ggml_vec_dot_q4_K_q8_K for LoongArch ASX
ggml : optimize ggml_vec_dot_q3_K_q8_K for LoongArch ASX

Feb 13 '25 07:02 MQ-mengqing

I got gguf from [https://huggingface.co/MaziyarPanahi/Llama-3.2-1B-Instruct-GGUF], and llama-bench shows on my [email protected] ASOC OS,

$ llama-bench -m Llama-3.2-1B-Instruct.Q2_K.gguf \
              -m Llama-3.2-1B-Instruct.Q3_K_S.gguf \
              -m Llama-3.2-1B-Instruct.Q4_K_S.gguf \
              -m Llama-3.2-1B-Instruct.Q5_K_S.gguf \
              -m Llama-3.2-1B-Instruct.Q6_K.gguf \
              -m Llama-3.2-1B-Instruct.Q8_0.gguf \
              -m Llama-3.2-1B-Instruct.IQ4_XS.gguf
Before,
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | CPU        |       8 |         pp512 |         33.15 ± 0.01 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | CPU        |       8 |         tg128 |         28.61 ± 0.14 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | CPU        |       8 |         pp512 |         30.04 ± 0.00 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | CPU        |       8 |         tg128 |         23.49 ± 0.05 |
| llama 1B Q4_K - Small          | 732.25 MiB |     1.24 B | CPU        |       8 |         pp512 |         31.33 ± 0.00 |
| llama 1B Q4_K - Small          | 732.25 MiB |     1.24 B | CPU        |       8 |         tg128 |         22.41 ± 0.05 |
| llama 1B Q5_K - Small          | 843.75 MiB |     1.24 B | CPU        |       8 |         pp512 |         27.76 ± 0.01 |
| llama 1B Q5_K - Small          | 843.75 MiB |     1.24 B | CPU        |       8 |         tg128 |         20.27 ± 0.03 |
| llama 1B Q6_K                  | 967.00 MiB |     1.24 B | CPU        |       8 |         pp512 |         27.51 ± 0.00 |
| llama 1B Q6_K                  | 967.00 MiB |     1.24 B | CPU        |       8 |         tg128 |         22.98 ± 0.03 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       8 |         pp512 |         35.64 ± 0.01 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       8 |         tg128 |         22.16 ± 0.03 |
| llama 1B IQ4_XS - 4.25 bpw     | 701.25 MiB |     1.24 B | CPU        |       8 |         pp512 |         24.48 ± 0.00 |
| llama 1B IQ4_XS - 4.25 bpw     | 701.25 MiB |     1.24 B | CPU        |       8 |         tg128 |         18.93 ± 0.02 |


After,

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | CPU        |       8 |         pp512 |         38.45 ± 0.01 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | CPU        |       8 |         tg128 |         33.49 ± 0.01 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | CPU        |       8 |         pp512 |         36.53 ± 0.01 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | CPU        |       8 |         tg128 |         27.26 ± 0.11 |
| llama 1B Q4_K - Small          | 732.25 MiB |     1.24 B | CPU        |       8 |         pp512 |         38.34 ± 0.02 |
| llama 1B Q4_K - Small          | 732.25 MiB |     1.24 B | CPU        |       8 |         tg128 |         25.51 ± 0.06 |
| llama 1B Q5_K - Small          | 843.75 MiB |     1.24 B | CPU        |       8 |         pp512 |         34.06 ± 0.02 |
| llama 1B Q5_K - Small          | 843.75 MiB |     1.24 B | CPU        |       8 |         tg128 |         23.37 ± 0.03 |
| llama 1B Q6_K                  | 967.00 MiB |     1.24 B | CPU        |       8 |         pp512 |         37.92 ± 0.01 |
| llama 1B Q6_K                  | 967.00 MiB |     1.24 B | CPU        |       8 |         tg128 |         27.72 ± 0.03 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       8 |         pp512 |         37.25 ± 0.01 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       8 |         tg128 |         22.17 ± 0.03 |
| llama 1B IQ4_XS - 4.25 bpw     | 701.25 MiB |     1.24 B | CPU        |       8 |         pp512 |         33.47 ± 0.01 |
| llama 1B IQ4_XS - 4.25 bpw     | 701.25 MiB |     1.24 B | CPU        |       8 |         tg128 |         23.46 ± 0.25 |

Feb 13 '25 08:02 MQ-mengqing

cc @junchao-loongson for review

Feb 13 '25 13:02 ggerganov

benchmark

cpu：3A6000 2.5G os：Deepin 23 gcc：14.2.0

junchao@junchao-PC ~/work/ai/llama.cpp                                                                                                                                                                                                             [15:38:49] 
> $ ./build/bin/llama-bench   -m ../model-gguf/Llama-3.2-1B-Instruct.Q2_K.gguf \                                                                                                                                                                   [±pr11842]
              -m ../model-gguf/Llama-3.2-1B-Instruct.Q3_K_S.gguf \
              -m ../model-gguf/Llama-3.2-1B-Instruct.Q4_K_S.gguf \
              -m ../model-gguf/Llama-3.2-1B-Instruct.Q5_K_S.gguf \
              -m ../model-gguf/Llama-3.2-1B-Instruct.Q6_K.gguf \
              -m ../model-gguf/Llama-3.2-1B-Instruct.Q8_0.gguf \
              -m ../model-gguf/Llama-3.2-1B-Instruct.IQ4_XS.gguf
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | CPU        |       8 |         pp512 |         38.15 ± 0.06 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | CPU        |       8 |         tg128 |         36.28 ± 0.30 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | CPU        |       8 |         pp512 |         35.48 ± 0.01 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | CPU        |       8 |         tg128 |         33.67 ± 0.24 |
| llama 1B Q4_K - Small          | 732.25 MiB |     1.24 B | CPU        |       8 |         pp512 |         38.26 ± 0.05 |
| llama 1B Q4_K - Small          | 732.25 MiB |     1.24 B | CPU        |       8 |         tg128 |         36.11 ± 0.05 |
| llama 1B Q5_K - Small          | 843.75 MiB |     1.24 B | CPU        |       8 |         pp512 |         34.25 ± 0.05 |
| llama 1B Q5_K - Small          | 843.75 MiB |     1.24 B | CPU        |       8 |         tg128 |         32.50 ± 0.24 |
| llama 1B Q6_K                  | 967.00 MiB |     1.24 B | CPU        |       8 |         pp512 |         38.05 ± 0.40 |
| llama 1B Q6_K                  | 967.00 MiB |     1.24 B | CPU        |       8 |         tg128 |         31.22 ± 0.22 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       8 |         pp512 |         37.29 ± 0.13 |
| llama 1B Q8_0                  |   1.22 GiB |     1.24 B | CPU        |       8 |         tg128 |         25.68 ± 0.17 |
| llama 1B IQ4_XS - 4.25 bpw     | 701.25 MiB |     1.24 B | CPU        |       8 |         pp512 |         33.65 ± 0.04 |
| llama 1B IQ4_XS - 4.25 bpw     | 701.25 MiB |     1.24 B | CPU        |       8 |         tg128 |         31.12 ± 0.22 |

build: 66ed5e38 (4712)

Benchmark can reproduce

ctest

junchao@junchao-PC ~/work/ai/llama.cpp                                                                                                                                                                                                             [14:53:41] 
> $ bash ./ci/run.sh ./tmp/results ./tmp/mnt  
.......
+ tee -a /home/junchao/work/ai/llama.cpp/tmp/results/ctest_release-ctest.log
+ ctest --output-on-failure -L main
Test project /home/junchao/work/ai/llama.cpp/build-ci-release
      Start  1: test-tokenizer-0-bert-bge
 1/28 Test  #1: test-tokenizer-0-bert-bge .........   Passed    0.03 sec
      Start  2: test-tokenizer-0-command-r
 2/28 Test  #2: test-tokenizer-0-command-r ........   Passed    0.60 sec
      Start  3: test-tokenizer-0-deepseek-coder
 3/28 Test  #3: test-tokenizer-0-deepseek-coder ...   Passed    0.07 sec
      Start  4: test-tokenizer-0-deepseek-llm
 4/28 Test  #4: test-tokenizer-0-deepseek-llm .....   Passed    0.20 sec
      Start  5: test-tokenizer-0-falcon
 5/28 Test  #5: test-tokenizer-0-falcon ...........   Passed    0.11 sec
      Start  6: test-tokenizer-0-gpt-2
 6/28 Test  #6: test-tokenizer-0-gpt-2 ............   Passed    0.09 sec
      Start  7: test-tokenizer-0-llama-bpe
 7/28 Test  #7: test-tokenizer-0-llama-bpe ........   Passed    0.35 sec
      Start  8: test-tokenizer-0-llama-spm
 8/28 Test  #8: test-tokenizer-0-llama-spm ........   Passed    0.04 sec
      Start  9: test-tokenizer-0-mpt
 9/28 Test  #9: test-tokenizer-0-mpt ..............   Passed    0.09 sec
      Start 10: test-tokenizer-0-phi-3
10/28 Test #10: test-tokenizer-0-phi-3 ............   Passed    0.04 sec
      Start 11: test-tokenizer-0-qwen2
11/28 Test #11: test-tokenizer-0-qwen2 ............   Passed    0.32 sec
      Start 12: test-tokenizer-0-refact
12/28 Test #12: test-tokenizer-0-refact ...........   Passed    0.09 sec
      Start 13: test-tokenizer-0-starcoder
13/28 Test #13: test-tokenizer-0-starcoder ........   Passed    0.09 sec
      Start 14: test-sampling
14/28 Test #14: test-sampling .....................   Passed    1.31 sec
      Start 15: test-grammar-parser
15/28 Test #15: test-grammar-parser ...............   Passed    0.00 sec
      Start 16: test-grammar-integration
16/28 Test #16: test-grammar-integration ..........   Passed    0.01 sec
      Start 17: test-llama-grammar
17/28 Test #17: test-llama-grammar ................   Passed    0.00 sec
      Start 18: test-chat
18/28 Test #18: test-chat .........................   Passed    0.67 sec
      Start 19: test-tokenizer-1-llama-spm
19/28 Test #19: test-tokenizer-1-llama-spm ........   Passed    0.28 sec
      Start 20: test-log
20/28 Test #20: test-log ..........................   Passed    0.02 sec
      Start 21: test-arg-parser
21/28 Test #21: test-arg-parser ...................   Passed    0.06 sec
      Start 22: test-chat-template
22/28 Test #22: test-chat-template ................   Passed    0.13 sec
      Start 23: test-gguf
23/28 Test #23: test-gguf .........................   Passed    0.16 sec
      Start 24: test-backend-ops
24/28 Test #24: test-backend-ops ..................   Passed    0.01 sec
      Start 27: test-barrier
25/28 Test #27: test-barrier ......................   Passed    0.29 sec
      Start 28: test-quantize-fns
26/28 Test #28: test-quantize-fns .................   Passed   17.78 sec
      Start 29: test-quantize-perf
27/28 Test #29: test-quantize-perf ................   Passed    0.07 sec
      Start 30: test-rope
28/28 Test #30: test-rope .........................   Passed    0.15 sec

100% tests passed, 0 tests failed out of 28

Label Time Summary:
main    =  23.05 sec*proc (28 tests)

Total Test time (real) =  23.06 sec
.....

ctest pass

LGTM！

Feb 14 '25 08:02 junchao-loongson