llama.cpp OpenCL: Performance comparison depending on gpu

I expected more gpu_offloads get better performances(tokens/sec), however the bench-results were different.

The followings were executed on QCS8550 with a model (https://huggingface.co/LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct-GGUF/blob/main/EXAONE-3.5-2.4B-Instruct-Q4_K_M.gguf).

llama-bench -m ./EXAONE-3.5-2.4B-Instruct-Q4_K_M.gguf  -ngl 0,5,10,15,20,31

ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM) 740'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.42.20.00
ggml_opencl: vector subgroup broadcast support: false
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 256 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: A_q_d buffer size reduced from 311164928 to 268435456 due to device limitations.
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |   0 |         pp512 |         18.92 ± 0.18 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |   0 |         tg128 |          3.90 ± 0.10 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |   5 |         pp512 |         16.97 ± 0.03 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |   5 |         tg128 |          3.37 ± 0.02 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  10 |         pp512 |         16.23 ± 0.02 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  10 |         tg128 |          3.12 ± 0.02 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  15 |         pp512 |         15.87 ± 0.03 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  15 |         tg128 |          2.93 ± 0.01 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  20 |         pp512 |         15.22 ± 0.02 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  20 |         tg128 |          2.80 ± 0.01 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  31 |         pp512 |         13.81 ± 0.03 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  31 |         tg128 |          2.95 ± 0.14 |

Apr 08 '25 00:04 sparkleholic

@lhez , @max-krasnyansky I've checked the perf. via llama-bench on QCS8550 in case of OpenCL enabled. The odd point is that more gpu_offloads doesn't get better perf.

Apr 08 '25 00:04 sparkleholic

try Q4_0 ? I have tried Q4_K_M in the past and the performance was not good either. https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/OPENCL.md

Apr 08 '25 02:04 kizuna0487

try Q4_0 ? I have tried Q4_K_M in the past and the performance was not good either. https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/OPENCL.md

Thanks. I'm gonna test w/ Q4_0 either. I didn't notice there is no Adreno 740 (QCS8550) in the verfied device list, and no mention for supporting Q4_M_K either in OPENCL.md.

Apr 08 '25 04:04 sparkleholic

@sparkleholic - currently Q4_0 is optimized, so you will need to use --pure when quantizing the model to Q4_0. Without --pure, some layers will be quantized in Q6_K, resulting in worse performance. Q4_K is not currently supported. So when you run Q4_K_M models, Q4_K layers will be put back to CPU, resulting in even worse performance.

Adreno 740 should work just fine. Feel free to reply back if you see any issue with 740.

Apr 08 '25 04:04 lhez

@lhez, @kizuna0487 Thanks for the info. I've verified on QCS8550(Adreno 740) with Q4_0 works well, more n-gpu-layers gets better performance measure results.

EXAONE-3.5-2.4B-Instruct-Q4_0.gguf

| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |   0 |         pp512 |         16.46 ± 0.23 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |   0 |         tg128 |          5.29 ± 0.13 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |   5 |         pp512 |         17.98 ± 0.06 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |   5 |         tg128 |          4.31 ± 0.09 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |  10 |         pp512 |         21.16 ± 0.04 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |  10 |         tg128 |          5.16 ± 0.13 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |  15 |         pp512 |         25.82 ± 0.05 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |  15 |         tg128 |          6.20 ± 0.13 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |  20 |         pp512 |         33.12 ± 0.06 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |  20 |         tg128 |          7.85 ± 0.13 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |  31 |         pp512 |         75.02 ± 0.03 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |  31 |         tg128 |         11.85 ± 0.05 |

Apr 08 '25 09:04 sparkleholic

This issue was closed because it has been inactive for 14 days since being marked as stale.

May 23 '25 01:05 github-actions[bot]

OpenCL: Performance comparison depending on gpu_offloads