Feature Request: Multi core support for full GPU offload

Open 8XXD8 opened this issue 1 year ago • 7 comments

Prerequisites

[X] I am running the latest code. Mention the version if possible as well.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

When the entire model is offloaded to the GPU, llama.cpp will only use a single thread, regardless of the --threads argument. On systems with lower single core performance this holds back GPU utilization. I have noticed, that using RPC on localhost increased my token generation speed by ~30%.

HIP_VISIBLE_DEVICES=0 ./llama-bench -m /home/user/text-generation-webui/models/Meta-Llama-3-8B-Instruct-Q8_0.gguf Device 0: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 8B Q8_0	7.95 GiB	8.03 B	CUDA	99	pp512	230.09 ± 0.07
llama 8B Q8_0	7.95 GiB	8.03 B	CUDA	99	tg128	37.14 ± 0.13

With rpc-server running on the same GPU: HIP_VISIBLE_DEVICES=0 ./rpc-server -p 50052 HIP_VISIBLE_DEVICES=0 ./llama-bench -m /home/user/text-generation-webui/models/Meta-Llama-3-8B-Instruct-Q8_0.gguf --rpc 0.0.0.0:50052

model	size	params	backend	ngl	test	t/s
llama 8B Q8_0	7.95 GiB	8.03 B	CUDA+RPC	99	pp512	231.96 ± 0.09
llama 8B Q8_0	7.95 GiB	8.03 B	CUDA+RPC	99	tg128	48.40 ± 0.41

When offloading a single layer to the CPU, llama-bench will use more threads, increasing performance: HIP_VISIBLE_DEVICES=0 ./llama-bench -m /home/user/text-generation-webui/models/Meta-Llama-3-8B-Instruct-Q8_0.gguf -ngl 32 --threads 32

model	size	params	backend	ngl	threads	test	t/s
llama 8B Q8_0	7.95 GiB	8.03 B	CUDA	32	32	pp512	231.93 ± 0.08
llama 8B Q8_0	7.95 GiB	8.03 B	CUDA	32	32	tg128	46.72 ± 0.41

This problem limits multi GPU performance too, row split uses two threads, but 2 GPUs peg the cores at 100% and a third GPU reduces token generation speed.

Motivation

Servers or older CPUs have many cores, but low boost clocks, and a single thread can not reach full GPU utilization

Possible Implementation

Enable the --threads argument for full GPU offload

Jul 25 '24 08:07 8XXD8