llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Feature Request: Multi core support for full GPU offload

Open 8XXD8 opened this issue 1 year ago • 7 comments

Prerequisites

  • [X] I am running the latest code. Mention the version if possible as well.
  • [X] I carefully followed the README.md.
  • [X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

When the entire model is offloaded to the GPU, llama.cpp will only use a single thread, regardless of the --threads argument. On systems with lower single core performance this holds back GPU utilization. I have noticed, that using RPC on localhost increased my token generation speed by ~30%.

HIP_VISIBLE_DEVICES=0 ./llama-bench -m /home/user/text-generation-webui/models/Meta-Llama-3-8B-Instruct-Q8_0.gguf Device 0: AMD Radeon (TM) Pro VII, compute capability 9.0, VMM: no

model size params backend ngl test t/s
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 99 pp512 230.09 ± 0.07
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 99 tg128 37.14 ± 0.13

With rpc-server running on the same GPU: HIP_VISIBLE_DEVICES=0 ./rpc-server -p 50052 HIP_VISIBLE_DEVICES=0 ./llama-bench -m /home/user/text-generation-webui/models/Meta-Llama-3-8B-Instruct-Q8_0.gguf --rpc 0.0.0.0:50052

model size params backend ngl test t/s
llama 8B Q8_0 7.95 GiB 8.03 B CUDA+RPC 99 pp512 231.96 ± 0.09
llama 8B Q8_0 7.95 GiB 8.03 B CUDA+RPC 99 tg128 48.40 ± 0.41

When offloading a single layer to the CPU, llama-bench will use more threads, increasing performance: HIP_VISIBLE_DEVICES=0 ./llama-bench -m /home/user/text-generation-webui/models/Meta-Llama-3-8B-Instruct-Q8_0.gguf -ngl 32 --threads 32

model size params backend ngl threads test t/s
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 32 32 pp512 231.93 ± 0.08
llama 8B Q8_0 7.95 GiB 8.03 B CUDA 32 32 tg128 46.72 ± 0.41

This problem limits multi GPU performance too, row split uses two threads, but 2 GPUs peg the cores at 100% and a third GPU reduces token generation speed.

Motivation

Servers or older CPUs have many cores, but low boost clocks, and a single thread can not reach full GPU utilization

Possible Implementation

Enable the --threads argument for full GPU offload

8XXD8 avatar Jul 25 '24 08:07 8XXD8