Feature Request: Multi core support for full GPU offload
Prerequisites
- [X] I am running the latest code. Mention the version if possible as well.
- [X] I carefully followed the README.md.
- [X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [X] I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
When the entire model is offloaded to the GPU, llama.cpp will only use a single thread, regardless of the --threads argument.
On systems with lower single core performance this holds back GPU utilization.
I have noticed, that using RPC on localhost increased my token generation speed by ~30%.
HIP_VISIBLE_DEVICES=0 ./llama-bench -m /home/user/text-generation-webui/models/Meta-Llama-3-8B-Instruct-Q8_0.gguf
Device 0: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no
| model | size | params | backend | ngl | test | t/s |
|---|---|---|---|---|---|---|
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | pp512 | 230.09 ± 0.07 |
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 99 | tg128 | 37.14 ± 0.13 |
With rpc-server running on the same GPU:
HIP_VISIBLE_DEVICES=0 ./rpc-server -p 50052
HIP_VISIBLE_DEVICES=0 ./llama-bench -m /home/user/text-generation-webui/models/Meta-Llama-3-8B-Instruct-Q8_0.gguf --rpc 0.0.0.0:50052
| model | size | params | backend | ngl | test | t/s |
|---|---|---|---|---|---|---|
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA+RPC | 99 | pp512 | 231.96 ± 0.09 |
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA+RPC | 99 | tg128 | 48.40 ± 0.41 |
When offloading a single layer to the CPU, llama-bench will use more threads, increasing performance:
HIP_VISIBLE_DEVICES=0 ./llama-bench -m /home/user/text-generation-webui/models/Meta-Llama-3-8B-Instruct-Q8_0.gguf -ngl 32 --threads 32
| model | size | params | backend | ngl | threads | test | t/s |
|---|---|---|---|---|---|---|---|
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 32 | 32 | pp512 | 231.93 ± 0.08 |
| llama 8B Q8_0 | 7.95 GiB | 8.03 B | CUDA | 32 | 32 | tg128 | 46.72 ± 0.41 |
This problem limits multi GPU performance too, row split uses two threads, but 2 GPUs peg the cores at 100% and a third GPU reduces token generation speed.
Motivation
Servers or older CPUs have many cores, but low boost clocks, and a single thread can not reach full GPU utilization
Possible Implementation
Enable the --threads argument for full GPU offload