opencl: add kernel to handle mat mul in attention to improve encoding speed
This PR adds a new kernel to specifically handle the matrix multiply in attention. This should improve encoding performance for most models.
On X Elite (X1-85),
master
| model | size | params | backend | ngl | test | t/s |
|---|---|---|---|---|---|---|
| qwen2 1.5B Q4_0 | 828.59 MiB | 1.54 B | OpenCL | 99 | pp512 | 399.49 ± 1.87 |
| qwen2 1.5B Q4_0 | 828.59 MiB | 1.54 B | OpenCL | 99 | pp1024 | 304.23 ± 2.80 |
| qwen2 1.5B Q4_0 | 828.59 MiB | 1.54 B | OpenCL | 99 | pp2048 | 209.09 ± 0.26 |
| qwen2 1.5B Q4_0 | 828.59 MiB | 1.54 B | OpenCL | 99 | tg256 | 33.85 ± 0.08 |
| qwen2 3B Q4_0 | 1.62 GiB | 3.09 B | OpenCL | 99 | pp512 | 217.37 ± 0.95 |
| qwen2 3B Q4_0 | 1.62 GiB | 3.09 B | OpenCL | 99 | pp1024 | 168.51 ± 0.40 |
| qwen2 3B Q4_0 | 1.62 GiB | 3.09 B | OpenCL | 99 | pp2048 | 117.27 ± 0.32 |
| qwen2 3B Q4_0 | 1.62 GiB | 3.09 B | OpenCL | 99 | tg256 | 20.86 ± 0.23 |
| qwen3 8B Q4_0 | 4.29 GiB | 8.19 B | OpenCL | 99 | pp512 | 103.78 ± 0.28 |
| qwen3 8B Q4_0 | 4.29 GiB | 8.19 B | OpenCL | 99 | pp1024 | 80.74 ± 0.58 |
| qwen3 8B Q4_0 | 4.29 GiB | 8.19 B | OpenCL | 99 | pp2048 | 56.83 ± 0.09 |
| qwen3 8B Q4_0 | 4.29 GiB | 8.19 B | OpenCL | 99 | tg256 | 12.82 ± 0.02 |
this PR,
| model | size | params | backend | ngl | test | t/s |
|---|---|---|---|---|---|---|
| qwen2 1.5B Q4_0 | 828.59 MiB | 1.54 B | OpenCL | 99 | pp512 | 658.22 ± 4.60 |
| qwen2 1.5B Q4_0 | 828.59 MiB | 1.54 B | OpenCL | 99 | pp1024 | 613.10 ± 3.58 |
| qwen2 1.5B Q4_0 | 828.59 MiB | 1.54 B | OpenCL | 99 | pp2048 | 525.78 ± 1.90 |
| qwen2 1.5B Q4_0 | 828.59 MiB | 1.54 B | OpenCL | 99 | tg256 | 33.74 ± 0.03 |
| qwen2 3B Q4_0 | 1.62 GiB | 3.09 B | OpenCL | 99 | pp512 | 342.08 ± 0.66 |
| qwen2 3B Q4_0 | 1.62 GiB | 3.09 B | OpenCL | 99 | pp1024 | 317.64 ± 1.18 |
| qwen2 3B Q4_0 | 1.62 GiB | 3.09 B | OpenCL | 99 | pp2048 | 274.53 ± 0.61 |
| qwen2 3B Q4_0 | 1.62 GiB | 3.09 B | OpenCL | 99 | tg256 | 21.08 ± 0.07 |
| qwen3 8B Q4_0 | 4.29 GiB | 8.19 B | OpenCL | 99 | pp512 | 158.72 ± 0.54 |
| qwen3 8B Q4_0 | 4.29 GiB | 8.19 B | OpenCL | 99 | pp1024 | 138.81 ± 0.17 |
| qwen3 8B Q4_0 | 4.29 GiB | 8.19 B | OpenCL | 99 | pp2048 | 114.03 ± 0.10 |
| qwen3 8B Q4_0 | 4.29 GiB | 8.19 B | OpenCL | 99 | tg256 | 12.81 ± 0.06 |
@shaofeiqi @max-krasnyansky on 8GEN3, this PR will decrese decoding performance。
without this pr
| PP | TG | B | repeat | N_KV | t_tg ms | e2e ms | TTFT ms | TPOT ms | TPS(pp) t/s | TPS(tg) t/s |
|---|---|---|---|---|---|---|---|---|---|---|
| 512 | 64 | 1 | 5 | 576 | 2888.96 | 5222.09 | 2333.12 | 45.86 | 219.45 | 22.88 |
with this pr
| PP | TG | B | repeat | N_KV | t_tg ms | e2e ms | TTFT ms | TPOT ms | TPS(pp) t/s | TPS(tg) t/s |
|---|---|---|---|---|---|---|---|---|---|---|
| 512 | 64 | 1 | 5 | 576 | 3744.50 | 5014.40 | 1269.90 | 59.44 | 403.19 | 17.68 |
test comand
LD_LIBRARY_PATH=./lib ./bin/llama-batched-bench -m ../Qwen3_0.6B_Q4_0.gguf -c 2304 -b 2048 -npp 512 -ntg 64 -npl 1 -ngl 99 --flash-attn off
@lippman1125 I suppose you are referring to tg. This PR should not affect tg.
On master, using 8Gen3,
ggml_opencl: device: 'QUALCOMM Adreno(TM) 750 (OpenCL 3.0 Adreno(TM) 750)' ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.45.02.16
without this change (manually comment out lines 6897 - 6902),
| model | size | params | backend | ngl | test | t/s |
|---|---|---|---|---|---|---|
| qwen3 0.6B Q4_0 | 403.42 MiB | 751.63 M | OpenCL | 99 | pp512 | 349.43 ± 1.44 |
| qwen3 0.6B Q4_0 | 403.42 MiB | 751.63 M | OpenCL | 99 | tg64 | 28.65 ± 2.84 |
build: 7d77f0732 (7108)
with this change,
| model | size | params | backend | ngl | test | t/s |
|---|---|---|---|---|---|---|
| qwen3 0.6B Q4_0 | 403.42 MiB | 751.63 M | OpenCL | 99 | pp512 | 798.45 ± 8.33 |
| qwen3 0.6B Q4_0 | 403.42 MiB | 751.63 M | OpenCL | 99 | tg64 | 28.48 ± 1.21 |
build: 7d77f0732 (7108)
tg numbers seem about the same for my setup.
@lhez Thanks for your reply , I verify it again, It's no problem. Good Job!