llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

opencl: add kernel to handle mat mul in attention to improve encoding speed

Open shaofeiqi opened this issue 5 months ago • 1 comments

This PR adds a new kernel to specifically handle the matrix multiply in attention. This should improve encoding performance for most models.

shaofeiqi avatar Nov 11 '25 22:11 shaofeiqi

On X Elite (X1-85),

master

model size params backend ngl test t/s
qwen2 1.5B Q4_0 828.59 MiB 1.54 B OpenCL 99 pp512 399.49 ± 1.87
qwen2 1.5B Q4_0 828.59 MiB 1.54 B OpenCL 99 pp1024 304.23 ± 2.80
qwen2 1.5B Q4_0 828.59 MiB 1.54 B OpenCL 99 pp2048 209.09 ± 0.26
qwen2 1.5B Q4_0 828.59 MiB 1.54 B OpenCL 99 tg256 33.85 ± 0.08
qwen2 3B Q4_0 1.62 GiB 3.09 B OpenCL 99 pp512 217.37 ± 0.95
qwen2 3B Q4_0 1.62 GiB 3.09 B OpenCL 99 pp1024 168.51 ± 0.40
qwen2 3B Q4_0 1.62 GiB 3.09 B OpenCL 99 pp2048 117.27 ± 0.32
qwen2 3B Q4_0 1.62 GiB 3.09 B OpenCL 99 tg256 20.86 ± 0.23
qwen3 8B Q4_0 4.29 GiB 8.19 B OpenCL 99 pp512 103.78 ± 0.28
qwen3 8B Q4_0 4.29 GiB 8.19 B OpenCL 99 pp1024 80.74 ± 0.58
qwen3 8B Q4_0 4.29 GiB 8.19 B OpenCL 99 pp2048 56.83 ± 0.09
qwen3 8B Q4_0 4.29 GiB 8.19 B OpenCL 99 tg256 12.82 ± 0.02

this PR,

model size params backend ngl test t/s
qwen2 1.5B Q4_0 828.59 MiB 1.54 B OpenCL 99 pp512 658.22 ± 4.60
qwen2 1.5B Q4_0 828.59 MiB 1.54 B OpenCL 99 pp1024 613.10 ± 3.58
qwen2 1.5B Q4_0 828.59 MiB 1.54 B OpenCL 99 pp2048 525.78 ± 1.90
qwen2 1.5B Q4_0 828.59 MiB 1.54 B OpenCL 99 tg256 33.74 ± 0.03
qwen2 3B Q4_0 1.62 GiB 3.09 B OpenCL 99 pp512 342.08 ± 0.66
qwen2 3B Q4_0 1.62 GiB 3.09 B OpenCL 99 pp1024 317.64 ± 1.18
qwen2 3B Q4_0 1.62 GiB 3.09 B OpenCL 99 pp2048 274.53 ± 0.61
qwen2 3B Q4_0 1.62 GiB 3.09 B OpenCL 99 tg256 21.08 ± 0.07
qwen3 8B Q4_0 4.29 GiB 8.19 B OpenCL 99 pp512 158.72 ± 0.54
qwen3 8B Q4_0 4.29 GiB 8.19 B OpenCL 99 pp1024 138.81 ± 0.17
qwen3 8B Q4_0 4.29 GiB 8.19 B OpenCL 99 pp2048 114.03 ± 0.10
qwen3 8B Q4_0 4.29 GiB 8.19 B OpenCL 99 tg256 12.81 ± 0.06

lhez avatar Nov 14 '25 19:11 lhez

@shaofeiqi @max-krasnyansky on 8GEN3, this PR will decrese decoding performance。

without this pr

PP TG B repeat N_KV t_tg ms e2e ms TTFT ms TPOT ms TPS(pp) t/s TPS(tg) t/s
512 64 1 5 576 2888.96 5222.09 2333.12 45.86 219.45 22.88

with this pr

PP TG B repeat N_KV t_tg ms e2e ms TTFT ms TPOT ms TPS(pp) t/s TPS(tg) t/s
512 64 1 5 576 3744.50 5014.40 1269.90 59.44 403.19 17.68

test comand LD_LIBRARY_PATH=./lib ./bin/llama-batched-bench -m ../Qwen3_0.6B_Q4_0.gguf -c 2304 -b 2048 -npp 512 -ntg 64 -npl 1 -ngl 99 --flash-attn off

lippman1125 avatar Nov 18 '25 14:11 lippman1125

@lippman1125 I suppose you are referring to tg. This PR should not affect tg.

On master, using 8Gen3,

ggml_opencl: device: 'QUALCOMM Adreno(TM) 750 (OpenCL 3.0 Adreno(TM) 750)' ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.45.02.16

without this change (manually comment out lines 6897 - 6902),

model size params backend ngl test t/s
qwen3 0.6B Q4_0 403.42 MiB 751.63 M OpenCL 99 pp512 349.43 ± 1.44
qwen3 0.6B Q4_0 403.42 MiB 751.63 M OpenCL 99 tg64 28.65 ± 2.84

build: 7d77f0732 (7108)

with this change,

model size params backend ngl test t/s
qwen3 0.6B Q4_0 403.42 MiB 751.63 M OpenCL 99 pp512 798.45 ± 8.33
qwen3 0.6B Q4_0 403.42 MiB 751.63 M OpenCL 99 tg64 28.48 ± 1.21

build: 7d77f0732 (7108)

tg numbers seem about the same for my setup.

lhez avatar Nov 19 '25 21:11 lhez

@lhez Thanks for your reply , I verify it again, It's no problem. Good Job!

lippman1125 avatar Nov 20 '25 11:11 lippman1125