ggml: optimize some vec dot functions for LoongArch ASX
-
ggml : optimize ggml_vec_dot_iq4_xs_q8_K for LoongArch ASX
-
ggml : optimize mul_sum_i8_pairs_float for LoongArch ASX
-
ggml : optimize ggml_vec_dot_q2_K_q8_K for LoongArch ASX
-
ggml : optimize ggml_vec_dot_q5_K_q8_K for LoongArch ASX
-
ggml : optimize ggml_vec_dot_q6_K_q8_K for LoongArch ASX
-
ggml : optimize ggml_vec_dot_q4_K_q8_K for LoongArch ASX
-
ggml : optimize ggml_vec_dot_q3_K_q8_K for LoongArch ASX
I got gguf from [https://huggingface.co/MaziyarPanahi/Llama-3.2-1B-Instruct-GGUF], and llama-bench shows on my [email protected] ASOC OS,
$ llama-bench -m Llama-3.2-1B-Instruct.Q2_K.gguf \
-m Llama-3.2-1B-Instruct.Q3_K_S.gguf \
-m Llama-3.2-1B-Instruct.Q4_K_S.gguf \
-m Llama-3.2-1B-Instruct.Q5_K_S.gguf \
-m Llama-3.2-1B-Instruct.Q6_K.gguf \
-m Llama-3.2-1B-Instruct.Q8_0.gguf \
-m Llama-3.2-1B-Instruct.IQ4_XS.gguf
Before,
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 1B Q2_K - Medium | 546.50 MiB | 1.24 B | CPU | 8 | pp512 | 33.15 ± 0.01 |
| llama 1B Q2_K - Medium | 546.50 MiB | 1.24 B | CPU | 8 | tg128 | 28.61 ± 0.14 |
| llama 1B Q3_K - Small | 604.50 MiB | 1.24 B | CPU | 8 | pp512 | 30.04 ± 0.00 |
| llama 1B Q3_K - Small | 604.50 MiB | 1.24 B | CPU | 8 | tg128 | 23.49 ± 0.05 |
| llama 1B Q4_K - Small | 732.25 MiB | 1.24 B | CPU | 8 | pp512 | 31.33 ± 0.00 |
| llama 1B Q4_K - Small | 732.25 MiB | 1.24 B | CPU | 8 | tg128 | 22.41 ± 0.05 |
| llama 1B Q5_K - Small | 843.75 MiB | 1.24 B | CPU | 8 | pp512 | 27.76 ± 0.01 |
| llama 1B Q5_K - Small | 843.75 MiB | 1.24 B | CPU | 8 | tg128 | 20.27 ± 0.03 |
| llama 1B Q6_K | 967.00 MiB | 1.24 B | CPU | 8 | pp512 | 27.51 ± 0.00 |
| llama 1B Q6_K | 967.00 MiB | 1.24 B | CPU | 8 | tg128 | 22.98 ± 0.03 |
| llama 1B Q8_0 | 1.22 GiB | 1.24 B | CPU | 8 | pp512 | 35.64 ± 0.01 |
| llama 1B Q8_0 | 1.22 GiB | 1.24 B | CPU | 8 | tg128 | 22.16 ± 0.03 |
| llama 1B IQ4_XS - 4.25 bpw | 701.25 MiB | 1.24 B | CPU | 8 | pp512 | 24.48 ± 0.00 |
| llama 1B IQ4_XS - 4.25 bpw | 701.25 MiB | 1.24 B | CPU | 8 | tg128 | 18.93 ± 0.02 |
After,
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 1B Q2_K - Medium | 546.50 MiB | 1.24 B | CPU | 8 | pp512 | 38.45 ± 0.01 |
| llama 1B Q2_K - Medium | 546.50 MiB | 1.24 B | CPU | 8 | tg128 | 33.49 ± 0.01 |
| llama 1B Q3_K - Small | 604.50 MiB | 1.24 B | CPU | 8 | pp512 | 36.53 ± 0.01 |
| llama 1B Q3_K - Small | 604.50 MiB | 1.24 B | CPU | 8 | tg128 | 27.26 ± 0.11 |
| llama 1B Q4_K - Small | 732.25 MiB | 1.24 B | CPU | 8 | pp512 | 38.34 ± 0.02 |
| llama 1B Q4_K - Small | 732.25 MiB | 1.24 B | CPU | 8 | tg128 | 25.51 ± 0.06 |
| llama 1B Q5_K - Small | 843.75 MiB | 1.24 B | CPU | 8 | pp512 | 34.06 ± 0.02 |
| llama 1B Q5_K - Small | 843.75 MiB | 1.24 B | CPU | 8 | tg128 | 23.37 ± 0.03 |
| llama 1B Q6_K | 967.00 MiB | 1.24 B | CPU | 8 | pp512 | 37.92 ± 0.01 |
| llama 1B Q6_K | 967.00 MiB | 1.24 B | CPU | 8 | tg128 | 27.72 ± 0.03 |
| llama 1B Q8_0 | 1.22 GiB | 1.24 B | CPU | 8 | pp512 | 37.25 ± 0.01 |
| llama 1B Q8_0 | 1.22 GiB | 1.24 B | CPU | 8 | tg128 | 22.17 ± 0.03 |
| llama 1B IQ4_XS - 4.25 bpw | 701.25 MiB | 1.24 B | CPU | 8 | pp512 | 33.47 ± 0.01 |
| llama 1B IQ4_XS - 4.25 bpw | 701.25 MiB | 1.24 B | CPU | 8 | tg128 | 23.46 ± 0.25 |
cc @junchao-loongson for review
- benchmark
cpu:3A6000 2.5G os:Deepin 23 gcc:14.2.0
junchao@junchao-PC ~/work/ai/llama.cpp [15:38:49]
> $ ./build/bin/llama-bench -m ../model-gguf/Llama-3.2-1B-Instruct.Q2_K.gguf \ [±pr11842]
-m ../model-gguf/Llama-3.2-1B-Instruct.Q3_K_S.gguf \
-m ../model-gguf/Llama-3.2-1B-Instruct.Q4_K_S.gguf \
-m ../model-gguf/Llama-3.2-1B-Instruct.Q5_K_S.gguf \
-m ../model-gguf/Llama-3.2-1B-Instruct.Q6_K.gguf \
-m ../model-gguf/Llama-3.2-1B-Instruct.Q8_0.gguf \
-m ../model-gguf/Llama-3.2-1B-Instruct.IQ4_XS.gguf
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 1B Q2_K - Medium | 546.50 MiB | 1.24 B | CPU | 8 | pp512 | 38.15 ± 0.06 |
| llama 1B Q2_K - Medium | 546.50 MiB | 1.24 B | CPU | 8 | tg128 | 36.28 ± 0.30 |
| llama 1B Q3_K - Small | 604.50 MiB | 1.24 B | CPU | 8 | pp512 | 35.48 ± 0.01 |
| llama 1B Q3_K - Small | 604.50 MiB | 1.24 B | CPU | 8 | tg128 | 33.67 ± 0.24 |
| llama 1B Q4_K - Small | 732.25 MiB | 1.24 B | CPU | 8 | pp512 | 38.26 ± 0.05 |
| llama 1B Q4_K - Small | 732.25 MiB | 1.24 B | CPU | 8 | tg128 | 36.11 ± 0.05 |
| llama 1B Q5_K - Small | 843.75 MiB | 1.24 B | CPU | 8 | pp512 | 34.25 ± 0.05 |
| llama 1B Q5_K - Small | 843.75 MiB | 1.24 B | CPU | 8 | tg128 | 32.50 ± 0.24 |
| llama 1B Q6_K | 967.00 MiB | 1.24 B | CPU | 8 | pp512 | 38.05 ± 0.40 |
| llama 1B Q6_K | 967.00 MiB | 1.24 B | CPU | 8 | tg128 | 31.22 ± 0.22 |
| llama 1B Q8_0 | 1.22 GiB | 1.24 B | CPU | 8 | pp512 | 37.29 ± 0.13 |
| llama 1B Q8_0 | 1.22 GiB | 1.24 B | CPU | 8 | tg128 | 25.68 ± 0.17 |
| llama 1B IQ4_XS - 4.25 bpw | 701.25 MiB | 1.24 B | CPU | 8 | pp512 | 33.65 ± 0.04 |
| llama 1B IQ4_XS - 4.25 bpw | 701.25 MiB | 1.24 B | CPU | 8 | tg128 | 31.12 ± 0.22 |
build: 66ed5e38 (4712)
Benchmark can reproduce
- ctest
junchao@junchao-PC ~/work/ai/llama.cpp [14:53:41]
> $ bash ./ci/run.sh ./tmp/results ./tmp/mnt
.......
+ tee -a /home/junchao/work/ai/llama.cpp/tmp/results/ctest_release-ctest.log
+ ctest --output-on-failure -L main
Test project /home/junchao/work/ai/llama.cpp/build-ci-release
Start 1: test-tokenizer-0-bert-bge
1/28 Test #1: test-tokenizer-0-bert-bge ......... Passed 0.03 sec
Start 2: test-tokenizer-0-command-r
2/28 Test #2: test-tokenizer-0-command-r ........ Passed 0.60 sec
Start 3: test-tokenizer-0-deepseek-coder
3/28 Test #3: test-tokenizer-0-deepseek-coder ... Passed 0.07 sec
Start 4: test-tokenizer-0-deepseek-llm
4/28 Test #4: test-tokenizer-0-deepseek-llm ..... Passed 0.20 sec
Start 5: test-tokenizer-0-falcon
5/28 Test #5: test-tokenizer-0-falcon ........... Passed 0.11 sec
Start 6: test-tokenizer-0-gpt-2
6/28 Test #6: test-tokenizer-0-gpt-2 ............ Passed 0.09 sec
Start 7: test-tokenizer-0-llama-bpe
7/28 Test #7: test-tokenizer-0-llama-bpe ........ Passed 0.35 sec
Start 8: test-tokenizer-0-llama-spm
8/28 Test #8: test-tokenizer-0-llama-spm ........ Passed 0.04 sec
Start 9: test-tokenizer-0-mpt
9/28 Test #9: test-tokenizer-0-mpt .............. Passed 0.09 sec
Start 10: test-tokenizer-0-phi-3
10/28 Test #10: test-tokenizer-0-phi-3 ............ Passed 0.04 sec
Start 11: test-tokenizer-0-qwen2
11/28 Test #11: test-tokenizer-0-qwen2 ............ Passed 0.32 sec
Start 12: test-tokenizer-0-refact
12/28 Test #12: test-tokenizer-0-refact ........... Passed 0.09 sec
Start 13: test-tokenizer-0-starcoder
13/28 Test #13: test-tokenizer-0-starcoder ........ Passed 0.09 sec
Start 14: test-sampling
14/28 Test #14: test-sampling ..................... Passed 1.31 sec
Start 15: test-grammar-parser
15/28 Test #15: test-grammar-parser ............... Passed 0.00 sec
Start 16: test-grammar-integration
16/28 Test #16: test-grammar-integration .......... Passed 0.01 sec
Start 17: test-llama-grammar
17/28 Test #17: test-llama-grammar ................ Passed 0.00 sec
Start 18: test-chat
18/28 Test #18: test-chat ......................... Passed 0.67 sec
Start 19: test-tokenizer-1-llama-spm
19/28 Test #19: test-tokenizer-1-llama-spm ........ Passed 0.28 sec
Start 20: test-log
20/28 Test #20: test-log .......................... Passed 0.02 sec
Start 21: test-arg-parser
21/28 Test #21: test-arg-parser ................... Passed 0.06 sec
Start 22: test-chat-template
22/28 Test #22: test-chat-template ................ Passed 0.13 sec
Start 23: test-gguf
23/28 Test #23: test-gguf ......................... Passed 0.16 sec
Start 24: test-backend-ops
24/28 Test #24: test-backend-ops .................. Passed 0.01 sec
Start 27: test-barrier
25/28 Test #27: test-barrier ...................... Passed 0.29 sec
Start 28: test-quantize-fns
26/28 Test #28: test-quantize-fns ................. Passed 17.78 sec
Start 29: test-quantize-perf
27/28 Test #29: test-quantize-perf ................ Passed 0.07 sec
Start 30: test-rope
28/28 Test #30: test-rope ......................... Passed 0.15 sec
100% tests passed, 0 tests failed out of 28
Label Time Summary:
main = 23.05 sec*proc (28 tests)
Total Test time (real) = 23.06 sec
.....
ctest pass
LGTM!