llama.cpp rpc : copy tensors across servers

This is an attempt to make copying tensors across servers more efficient. It introduces 2 new RPC commands:

HELLO - send after establishing connection to identify the remote party (client or server)
REMOTE_COPY_TENSOR - send to the host which contains the source tensor along with the destination tensor and destination endpoint

sequenceDiagram
    Note over Scheduler: Copy X on Server A to Y on Server B
    Scheduler->>Server A: REMOTE_COPY_TENSOR
    Server A->>Server B: HELLO
    Server A->>Server B: SET_TENSOR
    Server B-->>Server A: 
    Server A-->>Scheduler:

[x] I have read the contributing guidelines
Self-reported review complexity:
- [ ] Low
- [x] Medium
- [ ] High

Jun 20 '24 10:06 rgerganov

@rgerganov any updates on rpc? i did profiling, but no ideas for fast(small git --diff) optimization. i want see any refactoring on llama.cpp rpc

Jan 25 '25 17:01 lexasub

@lexasub I don't have any ideas how to speed things up with a single RPC server. With multiple RPC servers, you can try to resurrect this patch and see if it makes things better for your usecases. My benchmarks back in the day didn't show any significant improvements but I may have missed something.

Jan 27 '25 14:01 rgerganov

@rgerganov I attempted to rebase this branch to resolve conflicts with the latest upstream changes, but the scope of conflicts (especially in ggml.rpc.cpp and buffer context handling) suggests that manual adjustments might be unavoidable. I’ve started reworking some sections locally, but I’m concerned about diverging from your intended approach.

Question: Have you been working on a more up-to-date version of this branch? If so, could you share it or highlight key changes that need preservation? This would help ensure alignment and avoid redundant work.

Jan 28 '25 19:01 lexasub

Question: Have you been working on a more up-to-date version of this branch?

No, I am not working on this and I don't have updates. If you are going to work on this, my recommendation is to prepare a real setup with at least 3 hosts connected on a physical network and perform some benchmarks to have a baseline.

Testing on the same physical host with servers running on localhost may not give relevant results.

Jan 29 '25 07:01 rgerganov

@rgerganov Challenges in implementing an output queue "pipeline" for the ggml client-server architecture have arisen due to the proximity of code that utilizes the output parameter in send_rpc_cmd.

The parameter is intended to be written by thread later, but integrating its usage at the appropriate point in the codebase has proven complex, particularly given limited familiarity with ggml's architecture. (how and when try get data(from thread? to some usage (usage is complex))

While the current focus is on achieving functionality, concerns remain about potential inefficiencies, such as waiting for the output to populate, which could hinder parallel processing on the server side.

The ongoing work can be tracked in the https://github.com/lexasub/llama.cpp/tree/async-rpc-squashed/ggml/src/ggml-rpc (draft)

Feb 02 '25 02:02 lexasub

@rgerganov, I previously considered using gRPC here, but I can't yet say if it will have the desired effect. Does llama RPC transmit a lot of metadata (like field names and delimiters), or is everything packed as efficiently as possible (not in terms of pragma pack, but in terms of field names, as I mentioned)? If a significant amount of metadata (like names) is currently being transmitted, I'm willing to conduct research on gRPC. also we can try compress tensors before sending

Feb 06 '25 01:02 lexasub

My initial implementation of the RPC backend was using gRPC and switching to a custom binary serialization improved the performance a lot: https://github.com/ggerganov/llama.cpp/pull/6829#issuecomment-2082477213

Feb 06 '25 14:02 rgerganov

Question: Have you been working on a more up-to-date version of this branch?

No, I am not working on this and I don't have updates. If you are going to work on this, my recommendation is to prepare a real setup with at least 3 hosts connected on a physical network and perform some benchmarks to have a baseline.

Testing on the same physical host with servers running on localhost may not give relevant results.

I have 3 hosts on a physical network and will be willing to test this if anyone picks it up. How is the tensor copied today? I'm guessing for each server back to the scheduler and then from the scheduler to the next server? By implementing this, this should cut down latency by half?

Apr 23 '25 21:04 segmond