Inferencing not working with P2P in latest version.
LocalAI version:
localai/localai:latest-gpu-nvidia-cuda-12 LocalAI version: v2.22.1 (015835dba2854572d50e167b7cade05af41ed214)
Environment, CPU architecture, OS, and Version:
Linux localai3 6.8.12-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-2 (2024-09-05T10:03Z) x86_64 GNU/Linux (Proxmox LXC, Debian. AMD EPYC 7302P (16 cores allocated)/64GB RAM
Describe the bug
When testing distributed inferencing, i select a model (qwen 2.5 14b), send a chat message, the model loads on both instances (main and worker) and then the model does not respond and the model unloads on the worker. (watching with nvitop)
To Reproduce
description above should reproduce, i tried a few times.
Expected behavior
model should not unload & chat should complete
Logs
worker logs
{"level":"INFO","time":"2024-10-26T05:07:23.924Z","caller":"discovery/dht.go:115","message":" Bootstrapping DHT"}
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX 2000 Ada Generation, compute capability 8.9, VMM: yes
Starting RPC server on 127.0.0.1:46609, backend memory: 16380 MB
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
main logs
5:25AM INF Success ip=my.ip.address latency="960.876µs" method=POST status=200 url=/v1/chat/completions
5:25AM INF Trying to load the model 'qwen2.5-14b-instruct' with the backend '[llama-cpp llama-ggml llama-cpp-fallback rwkv stablediffusion whisper piper huggingface bert-embeddings /build/backend/python/rerankers/run.sh /build/backend/python/diffusers/run.sh /build/backend/python/vall-e-x/run.sh /build/backend/python/parler-tts/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/mamba/run.sh /build/backend/python/openvoice/run.sh /build/backend/python/coqui/run.sh /build/backend/python/bark/run.sh /build/backend/python/transformers-musicgen/run.sh /build/backend/python/transformers/run.sh /build/backend/python/exllama2/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/autogptq/run.sh /build/backend/python/vllm/run.sh]'
5:25AM INF [llama-cpp] Attempting to load
5:25AM INF Loading model 'qwen2.5-14b-instruct' with backend llama-cpp
5:25AM INF [llama-cpp-grpc] attempting to load with GRPC variant
5:25AM INF Redirecting 127.0.0.1:35625 to /ip4/worker-ip/udp/44701/quic-v1
5:25AM INF Redirecting 127.0.0.1:35625 to /ip4/worker-ip/udp/44701/quic-v1
5:25AM INF Redirecting 127.0.0.1:35625 to /ip4/worker-ip/udp/44701/quic-v1
5:25AM INF Redirecting 127.0.0.1:35625 to /ip4/worker-ip/udp/44701/quic-v1
5:25AM INF Success ip=127.0.0.1 latency="35.55µs" method=GET status=200 url=/readyz
5:26AM INF Node localai-oYURMqpWCR is offline, deleting
Error accepting: accept tcp 127.0.0.1:35625: use of closed network connection
Additional context
this worked in the last version, though i'm not sure what that was at this point (~2 weeks ago) model loads and works fine without the worker.
i'm using docker compose, here's the config. https://github.com/j4ys0n/local-ai-stack
it is related to edgevpn, it somehow see peer's and addresses but cannot connect with them.
I has tried NAT traversal using libp2p+pubsub and I have managed to make peer discovery and establish p2p connection by randez-vous point.
In your case, if you know your worker address you can just put worker address into ENV of local-ai as gRPC external backend addresses.
@mudler
fixed by #4220 more info in duplicate (sorry for that) #4214
fixed by https://github.com/mudler/LocalAI/pull/4220 more info in duplicate (sorry for that) https://github.com/mudler/LocalAI/issues/4214
looks like it still didn't make to the images quay.io/go-skynet/local-ai:latest-cpu or the (dockerhub) localai/localai:latest-cpu?
Still not able to do p2p inferencing even if workers are online v2.25.0 (07655c0c2e0e5fe2bca86339a12237b69d258636)

server and workers envs
CONTEXT_SIZE: "512"
THREADS: "4"
MODELS_PATH: /models
LLAMACPP_PARALLEL: "999"
TOKEN: &p2ptoken "xxx"
P2P_TOKEN: *p2ptoken
LOCALAI_P2P_TOKEN: *p2ptoken
LOCALAI_P2P_LISTEN_MADDRS: /ip4/0.0.0.0/tcp/8888
Server args: run --p2p
Worker args: worker p2p-llama-cpp-rpc '--llama-cpp-args=-H 0.0.0.0 -p 8082 -m 4096'
Container ports 8082 and 8888 are open using ports.containerPort
k8s based setup
attaching logs below for server and one of the worker out of 2
Still not able to do p2p inferencing even if workers are online
v2.25.0 (07655c0c2e0e5fe2bca86339a12237b69d258636)server and workers envs
CONTEXT_SIZE: "512" THREADS: "4" MODELS_PATH: /models LLAMACPP_PARALLEL: "999" TOKEN: &p2ptoken "xxx" P2P_TOKEN: *p2ptoken LOCALAI_P2P_TOKEN: *p2ptoken LOCALAI_P2P_LISTEN_MADDRS: /ip4/0.0.0.0/tcp/8888
Server args: run --p2p
Worker args: worker p2p-llama-cpp-rpc '--llama-cpp-args=-H 0.0.0.0 -p 8082 -m 4096'
Container ports 8082 and 8888 are open using
ports.containerPortk8s based setup
attaching logs below for server and one of the worker out of 2
@pratikbin can you share the server logs with debug enabled? DEBUG=true. It looks like some versions incompatibilities.
Also, could you try with master images to double check? that'd be helpful. Thank you!
I cannot say if I have the EXACT same issue. I haven't debugged the libp2p/edgevpn code yet.
But P2P doesn't work.
HOWEVER, if I run local-ai worker llama-cpp-rpc --llama-cpp-args=" -p 8080" on my worker node and do LLAMACPP_GRPC_SERVERS="192.168.1.236:8080" DEBUG=true local-ai run falcon3-1b-instruct
It DOES work. I run a small 1B model to just verify that something runs as a baseline for this functionality.
I can manage with this, though the P2P stuff is really helpful in auto discovery.
the host local ai instance also detects peers fine, the GRPC just never seems to get the LLAMACPP_GRPC_SERVERS updated per the logs.
Oh and an obvious thing I notice is the defaulting to binding to 127.0.0.1 but IDK if the VPN somehow bypasses the loopback limitation somehow.
There you go. let me know if you need anything else
mmh ok that looks weird: what's the environment? it looks like they can auto-discover correctly, but it exhausts somehow resource limits. Typically that is set by looking at the system env.
Did you also try to bump the UDP buffer sizes? https://github.com/quic-go/quic-go/wiki/UDP-Buffer-Sizes#non-bsd
I cannot say if I have the EXACT same issue. I haven't debugged the libp2p/edgevpn code yet.
But P2P doesn't work.
HOWEVER, if I run
local-ai worker llama-cpp-rpc --llama-cpp-args=" -p 8080"on my worker node and doLLAMACPP_GRPC_SERVERS="192.168.1.236:8080" DEBUG=true local-ai run falcon3-1b-instructIt DOES work. I run a small 1B model to just verify that something runs as a baseline for this functionality.
I can manage with this, though the P2P stuff is really helpful in auto discovery.
the host local ai instance also detects peers fine, the GRPC just never seems to get the
LLAMACPP_GRPC_SERVERSupdated per the logs.Oh and an obvious thing I notice is the defaulting to binding to 127.0.0.1 but IDK if the VPN somehow bypasses the loopback limitation somehow.
shall try this one
Is there a fix for this? Because I think the p2p feature is also broken on the Jetson.
Is there a fix for this? Because I think the p2p feature is also broken on the Jetson.
prob upto mudler. I have not been able to justify the time to dig into this problem and my priorities have shifted for now.
prob upto mudler. I have not been able to justify the time to dig into this problem and my priorities have shifted for no
Thats no worries. Is there a specific commit this was broken in? Im thinking that I can roll back to before that commit and rebuild the container image from scratch
prob upto mudler. I have not been able to justify the time to dig into this problem and my priorities have shifted for no
Thats no worries. Is there a specific commit this was broken in? Im thinking that I can roll back to before that commit and rebuild the container image from scratch
no idea. you will have to ride the commit tree yourself. good luck 🖖
I have this issue also with the same errors as the original poster.
I do note that when I try and run a model I get an error such as the following in the logs:
DBG GRPC(hermes-2-pro-llama-3-8b:Q8_0-127.0.0.1:43077): stderr load_tensors: tensor 'token_embd.weight' (q8_0) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
I had thought that Q8 and F16 models were supported....
Hi team,
I tried to reproduce this issue and noticed that my logs looked almost identical to what was described in the original post:
Reproduction Steps The following command reproduces the issue:
Server:
sudo docker run -it --rm \
-e DEBUG=true \
--net host \
--name local-ai \
localai/localai:latest-cpu \
run moondream2 --p2p
Worker:
sudo docker run -it --rm \
-e DEBUG=true \
-e TOKEN=$TOKEN \
--name localai-worker \
localai/localai:latest-cpu \
worker p2p-llama-cpp-rpc
Worker Logs: This is the similar log pattern reported in the original post
Accepted client connection, free_mem=135034642432, total_mem=135034642432
Client connection closed
Accepted client connection, free_mem=135034642432, total_mem=135034642432
Client connection closed
Accepted client connection, free_mem=135034642432, total_mem=135034642432
Client connection closed
Accepted client connection, free_mem=135034642432, total_mem=135034642432
Client connection closed
Accepted client connection, free_mem=135034642432, total_mem=135034642432
Client connection closed
Observation
The total_mem value matches my host system’s memory as reported by free -b, suggesting the container sees the full host memory instead of a limited one.
What I Tried I suspected it might be memory-related, so I re-ran with explicit Docker memory limits:
Server:
sudo docker run -it --rm \
-m="8g" --memory-swap="8g" \
-e DEBUG=true \
--net host \
--name local-ai \
localai/localai:latest-cpu \
run moondream2 --p2p
Worker:
sudo docker run -it --rm \
-m="8g" --memory-swap="8g" \
-e DEBUG=true \
-e TOKEN=$TOKEN \
--name localai-worker \
localai/localai:latest-cpu \
worker p2p-llama-cpp-rpc
After doing this, the worker behaved as expected.
Conclusion
I’m not entirely sure, but this might be related to Docker not enforcing memory limits unless explicitly set. Adding -m and --memory-swap resolved it in my case.
Hope this helps!
Since I'm having the same issues I tried @chengchialai0719 setup but without success. I can see the worker node in the "swarm" tab but as soon as i start chatting (with f.i. qwen3-1.7b which is small in size) I get this in the container's logs on the main node:
1:43PM ERR failed to install model "Qwen3-1.7B.Q4_K_M.gguf" from gallery error="no model found with name \"Qwen3-1.7B.Q4_K_M.gguf\""
1:43PM INF Trying to load the model 'qwen3-1.7b' with the backend '[llama-cpp llama-cpp-fallback piper silero-vad stablediffusion-ggml whisper huggingface]'
1:43PM INF [llama-cpp] Attempting to load
1:43PM INF BackendLoader starting backend=llama-cpp modelID=qwen3-1.7b o.model=Qwen3-1.7B.Q4_K_M.gguf
1:43PM INF [llama-cpp-grpc] attempting to load with GRPC variant
1:43PM INF Redirecting 127.0.0.1:40819 to /ip4/<ip-of-worker-node>/udp/53460/quic-v1
1:43PM INF Redirecting 127.0.0.1:40819 to /ip4/<ip-of-worker-node>/udp/53460/quic-v1
1:43PM INF Redirecting 127.0.0.1:40819 to /ip4/<ip-of-worker-node>/udp/53460/quic-v1
1:43PM INF Redirecting 127.0.0.1:40819 to /ip4/<ip-of-worker-node>/udp/53460/quic-v1
1:43PM INF Redirecting 127.0.0.1:40819 to /ip4/<ip-of-worker-node>/udp/53460/quic-v1
1:43PM INF Success ip=127.0.0.1 latency="40.648µs" method=GET status=200 url=/readyz
1:44PM INF Node PC-VTzNyTIclT is offline, deleting
Error accepting: accept tcp 127.0.0.1:40819: use of closed network connection
and this on the worker node:
create_backend: using CPU backend
Starting RPC server v2.0.0
endpoint : 127.0.0.1:44453
local cache : n/a
backend memory : 15208 MB
Accepted client connection, free_mem=15947444224, total_mem=15947444224
Client connection closed
Accepted client connection, free_mem=15947444224, total_mem=15947444224
Client connection closed
Accepted client connection, free_mem=15947444224, total_mem=15947444224
Client connection closed
Accepted client connection, free_mem=15947444224, total_mem=15947444224
Client connection closed
Accepted client connection, free_mem=15947444224, total_mem=15947444224
Client connection closed
I also tried using the vulkan image on both nodes as well as the aio-cpu image on both (latest in all cases) and sadly still had no success. The model exists on both nodes since I mount a folder to each docker container which contains the models. And this works fine if using the vulcan image on the worker node without p2p, so this is not the issue.
I also get this error if using this setup: main node:
version: "3.9"
services:
api:
image: localai/localai:latest-aio-cpu
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
interval: 1m
timeout: 20m
retries: 5
ports:
- 8080:8080
volumes:
- /appdata/local-ai/models:/build/models
environment:
- TOKEN=XXX
- LOCALAI_P2P=true
- LOCALAI_MODELS_PATH=/build/models
network_mode: host
command: ["run", "--p2p"]
deploy:
resources:
limits:
memory: 8g
reservations:
memory: 8g
mem_swappiness: 0
worker node:
version: "3.9"
services:
api:
image: localai/localai:latest-aio-cpu
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
interval: 1m
timeout: 20m
retries: 5
ports:
- 8080:8080
command: ["worker", "p2p-llama-cpp-rpc"]
# I left this here since I swapped with the vulcan image trying to get this working
devices:
- '/dev/kfd:/dev/kfd'
- '/dev/dri:/dev/dri'
group_add:
- video
environment:
- LOCALAI_MODELS_PATH=/build/models
- LOCALAI_P2P=true
- TOKEN=XXX
network_mode: host
volumes:
- /appdata/local-ai/models:/build/models
deploy:
resources:
limits:
memory: 8g
reservations:
memory: 8g
mem_swappiness: 0
Any idea how to fix this issue?
Did you ever manage to get this working? I have substantially the same setup, and the same outcomes. The workers check in just fine, but when I rinse a request to th master, it simply never sends a request to the worker.
Unfortunately, no. I instead switched to Llama.cpp backend with RPC to get the same functionality. Surely it's more complex (also required building the backend myself since RPC is not included by default) but in the end it was worth it. I was able to distribute a pretty large LLM over three GPUs with 8GB of VRAM each, and it run pretty quick even though only connected using a 1 Gbit link.
-------- Ursprüngliche Nachricht -------- Am 23.08.25 04:51 schrieb Matt Cowger :
mcowger left a comment (mudler/LocalAI#3968)
Did you ever manage to get this working? I have substantially the same setup, and the same outcomes. The workers check in just fine, but when I rinse a request to th master, it simply never sends a request to the worker.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
Hitting same issue, but it seems that it may be related to changes in llama.cpp, not Local AI itself. Perhaps this: https://github.com/ggml-org/llama.cpp/commit/898acba6816ad23b6a9491347d30e7570bffadfd?