llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Feature Request: Tensor Parallelism support

Open ClarkChin08 opened this issue 1 year ago • 3 comments

Prerequisites

  • [X] I am running the latest code. Mention the version if possible as well.
  • [X] I carefully followed the README.md.
  • [X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Tensor parallelism is a a critical technique employed to train and inference from very large language models by splitting the actual computations/tensors across multiple compute devices.

Motivation

In our previous implementation on Xeon CPU, tensor parallelism(TP) can significantly reduce the latency on inference.

model precision TP size input_size nex_token_time/ms
llama2-70b q4_j 1 32 191.91
llama2-70b q4_j 2 32 120.87
llama2-70b q4_j 4 32 86.15
llama2-70b q4_j 1 1024 197.18
llama2-70b q4_j 2 1024 129.25
llama2-70b q4_j 4 1024 91.76
llama2-70b q4_j 1 2012 204.85
llama2-70b q4_j 2 2012 127.31
llama2-70b q4_j 4 2012 100.44

Notice: TP size= 1 means not use TP.

Possible Implementation

In our TP implementation, we adopt the method of pre-splitting the corresponding weights, so the time consumed for this part is one-time and does not affect inference performance. Meanwhile, another major factor impacting performance is 'all reduce'. Since each node computes partial and incomplete results, it is necessary to perform 'all reduce' on the output data. But all reduce is relatively time-consuming, interestingly, by using a reasonable splitting and combining method, primitives can be operated independently across nodes, which is very helpful for performance optimization. Thus, a rational splitting method becomes extremely important.

Taking the FFN module as an example, if the first matmul splits by column and computes the matmul with input, it will result in two unrelated sub-matrices on each node. These two sub-matrices, when performing the second matmul operation, can proceed directly without having to perform 'all reduce' if splitting by rows. Thus, the entire FFN module only requires one 'all reduce', meaning that with properly tailored split implementation, even with multiple matmul operations, only one 'all reduce' operation may be needed. We ignored the element-wise operations between matmul as they would not influence the results. image The scenario for the attention module is more complex. As shown in the following figure, a rational split can make it so that the entire attention module only requires one 'all reduce' operation, thus greatly saving synchronization time. image

ClarkChin08 avatar Aug 19 '24 01:08 ClarkChin08

Not sure if this related to #4014

Chocobi-1129 avatar Aug 20 '24 08:08 Chocobi-1129

Not sure if this related to #4014

To reduce the communication time and improve latency, we should minimize the use of 'all reduce.' My proposal includes two improvements:

  1. Splitting the Weight Tensor Before Inference:

    • We can split the weight tensor and distribute partial weights across each TP (Tensor Parallel) node during the 'llm_load_tensors' phase. This involves adding three specific tensor split types when creating the tensor. The different tensor split methods will allow us to avoid 'all reduce' operations between two matrix multiplications (matmuls) just like the pictures showed above. Additionally, element-wise operations will also only calculate partial tensors. image

    • We can create weight tensors with the split type as shown in the following illustration: image

  2. Inference with Splitted Weights:

    • After splitting the weights, each node will only infer a part of the model. The only change is that each attention block and MLP block will have one 'all reduce' operation. These 'all reduce' operations always follow the matmul with weights split by column.
    • We can fuse the 'all reduce' with the matmul operation by checking the weight split type, as illustrated below: image

By setting the tensor's split type during the weight loading phase and adding support for 'all reduce' during matmul calculations, we can reduce the number of 'all reduce' operations to just twice per layer. The computational workload that is only 1/world_size of the original, thereby significantly improving latency. The relevant pull request will be submitted soon, and any comments are welcome.

ClarkChin08 avatar Aug 26 '24 11:08 ClarkChin08

I would like this method to support not only newer GPUs. I for example have 4 Tesla P40s and would like to get a noticeable acceleration.

Vladonai avatar Aug 27 '24 16:08 Vladonai

@ClarkChin08 I'm interested in your experiments with tensor parallelism on Xeon CPUs. Can you tell me more details about the hardware you used? What was the latency between cluster nodes? Do you think it would scale to x8 or x16 nodes or there are diminishing returns due to the time spent on "all reduce" operations?

fairydreaming avatar Oct 25 '24 09:10 fairydreaming

@ClarkChin08 and currently whether llama.cpp can support Tensor Parallelism ?

CarlHuangNuc avatar Oct 28 '24 06:10 CarlHuangNuc

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Dec 13 '24 01:12 github-actions[bot]

This became stale but I think its still relevant. Backends that have this have a big speedup on large models on multi GPU setups.

henk717 avatar Jan 08 '25 20:01 henk717

I guess this mechanism can be used in NUMA architecture while treating a node as a standalone GPU with VRAM. This is important to speed up on a system with distributed memory.

So that a uniform implementation cross CPU and GPU could be considered.

Readon avatar Feb 21 '25 05:02 Readon

Thanks @Readon I linked this to my testing on dual socket Intel Xeon 6980P.. 6 numa nodes is rough lol...

https://github.com/ggml-org/llama.cpp/discussions/12088

ubergarm avatar Feb 27 '25 21:02 ubergarm

I guess this mechanism can be used in NUMA architecture while treating a node as a standalone GPU with VRAM. This is important to speed up on a system with distributed memory.

So that a uniform implementation cross CPU and GPU could be considered.

you are right. This was recently implemented by sglang.

https://lmsys.org/blog/2025-07-14-intel-xeon-optimization/#multi-numa-parallelism

rankaiyx avatar Jul 31 '25 13:07 rankaiyx

Was this implemented or moved? Another idea that keeps spouting is to do the tensor parallel operation over the multiple attention heads.

Kamayuq avatar Aug 05 '25 18:08 Kamayuq

This issue isn't stale. Still important.

shihanqu avatar Aug 30 '25 01:08 shihanqu

@l29ah

shihanqu avatar Aug 30 '25 01:08 shihanqu

@ggerganov @slaren I don't have the permissions to reopen this issue, but perhaps one of you can?

shihanqu avatar Sep 02 '25 16:09 shihanqu