D-i-t-gh
D-i-t-gh
Maybe looking at the source code of [distributed-llama](https://github.com/b4rtaz/distributed-llama) might help. I use it on multiple nodes (CPU only) and for now it seems to be the fastest solution (by far)...
How did you start the workers?
Hi, do you have enough free RAM on your systems? Dllama doesn't seem to check if the model will fit into RAM.
In order for rpc-server to split the model size, you have to set up more than 1 rpc-server afaik.
I use six nodes, this is the parameter for llama-cli `--rpc 192.168.0.150:50052,192.168.0.151:50052,192.168.0.152:50052,192.168.0.153:50052,192.168.0.154:50052,192.168.0.155:50052` It will split a 665GB model like this: ``` load_tensors: RPC[192.168.0.150:50052] model buffer size = 95096.06 MiB load_tensors:...
I run on each of my six nodes (CPU-only) this: `./rpc-server -p 50052 -H ` Then, on the first node, I run the llama-cli command. That means, the first node...
In my case it works like this - when I run the `llama-cli` command, it will first load the model tensors and then sequentially send parts of the model to...
> However, 192.168.13.12 does not participate in the inference.Bescause it's cpu utilization and memory usage are not growing This is the llama-cli command from the official documentation: `$ bin/llama-cli -m...