llama.cpp opencl: add kernel to handle mat mul in attention to improve encoding speed

This PR adds a new kernel to specifically handle the matrix multiply in attention. This should improve encoding performance for most models.

Nov 11 '25 22:11 shaofeiqi

On X Elite (X1-85),

master

model	size	params	backend	ngl	test	t/s
qwen2 1.5B Q4_0	828.59 MiB	1.54 B	OpenCL	99	pp512	399.49 ± 1.87
qwen2 1.5B Q4_0	828.59 MiB	1.54 B	OpenCL	99	pp1024	304.23 ± 2.80
qwen2 1.5B Q4_0	828.59 MiB	1.54 B	OpenCL	99	pp2048	209.09 ± 0.26
qwen2 1.5B Q4_0	828.59 MiB	1.54 B	OpenCL	99	tg256	33.85 ± 0.08
qwen2 3B Q4_0	1.62 GiB	3.09 B	OpenCL	99	pp512	217.37 ± 0.95
qwen2 3B Q4_0	1.62 GiB	3.09 B	OpenCL	99	pp1024	168.51 ± 0.40
qwen2 3B Q4_0	1.62 GiB	3.09 B	OpenCL	99	pp2048	117.27 ± 0.32
qwen2 3B Q4_0	1.62 GiB	3.09 B	OpenCL	99	tg256	20.86 ± 0.23
qwen3 8B Q4_0	4.29 GiB	8.19 B	OpenCL	99	pp512	103.78 ± 0.28
qwen3 8B Q4_0	4.29 GiB	8.19 B	OpenCL	99	pp1024	80.74 ± 0.58
qwen3 8B Q4_0	4.29 GiB	8.19 B	OpenCL	99	pp2048	56.83 ± 0.09
qwen3 8B Q4_0	4.29 GiB	8.19 B	OpenCL	99	tg256	12.82 ± 0.02

this PR,

model	size	params	backend	ngl	test	t/s
qwen2 1.5B Q4_0	828.59 MiB	1.54 B	OpenCL	99	pp512	658.22 ± 4.60
qwen2 1.5B Q4_0	828.59 MiB	1.54 B	OpenCL	99	pp1024	613.10 ± 3.58
qwen2 1.5B Q4_0	828.59 MiB	1.54 B	OpenCL	99	pp2048	525.78 ± 1.90
qwen2 1.5B Q4_0	828.59 MiB	1.54 B	OpenCL	99	tg256	33.74 ± 0.03
qwen2 3B Q4_0	1.62 GiB	3.09 B	OpenCL	99	pp512	342.08 ± 0.66
qwen2 3B Q4_0	1.62 GiB	3.09 B	OpenCL	99	pp1024	317.64 ± 1.18
qwen2 3B Q4_0	1.62 GiB	3.09 B	OpenCL	99	pp2048	274.53 ± 0.61
qwen2 3B Q4_0	1.62 GiB	3.09 B	OpenCL	99	tg256	21.08 ± 0.07
qwen3 8B Q4_0	4.29 GiB	8.19 B	OpenCL	99	pp512	158.72 ± 0.54
qwen3 8B Q4_0	4.29 GiB	8.19 B	OpenCL	99	pp1024	138.81 ± 0.17
qwen3 8B Q4_0	4.29 GiB	8.19 B	OpenCL	99	pp2048	114.03 ± 0.10
qwen3 8B Q4_0	4.29 GiB	8.19 B	OpenCL	99	tg256	12.81 ± 0.06

Nov 14 '25 19:11 lhez

@shaofeiqi @max-krasnyansky on 8GEN3, this PR will decrese decoding performance。

without this pr

PP	TG	B	repeat	N_KV	t_tg ms	e2e ms	TTFT ms	TPOT ms	TPS(pp) t/s	TPS(tg) t/s
512	64	1	5	576	2888.96	5222.09	2333.12	45.86	219.45	22.88

with this pr

PP	TG	B	repeat	N_KV	t_tg ms	e2e ms	TTFT ms	TPOT ms	TPS(pp) t/s	TPS(tg) t/s
512	64	1	5	576	3744.50	5014.40	1269.90	59.44	403.19	17.68

test comand LD_LIBRARY_PATH=./lib ./bin/llama-batched-bench -m ../Qwen3_0.6B_Q4_0.gguf -c 2304 -b 2048 -npp 512 -ntg 64 -npl 1 -ngl 99 --flash-attn off

Nov 18 '25 14:11 lippman1125

@lippman1125 I suppose you are referring to tg. This PR should not affect tg.

On master, using 8Gen3,

ggml_opencl: device: 'QUALCOMM Adreno(TM) 750 (OpenCL 3.0 Adreno(TM) 750)' ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.45.02.16

without this change (manually comment out lines 6897 - 6902),

model	size	params	backend	ngl	test	t/s
qwen3 0.6B Q4_0	403.42 MiB	751.63 M	OpenCL	99	pp512	349.43 ± 1.44
qwen3 0.6B Q4_0	403.42 MiB	751.63 M	OpenCL	99	tg64	28.65 ± 2.84

build: 7d77f0732 (7108)

with this change,

model	size	params	backend	ngl	test	t/s
qwen3 0.6B Q4_0	403.42 MiB	751.63 M	OpenCL	99	pp512	798.45 ± 8.33
qwen3 0.6B Q4_0	403.42 MiB	751.63 M	OpenCL	99	tg64	28.48 ± 1.21

build: 7d77f0732 (7108)

tg numbers seem about the same for my setup.

Nov 19 '25 21:11 lhez

@lhez Thanks for your reply , I verify it again, It's no problem. Good Job!

Nov 20 '25 11:11 lippman1125