LocalAI Inferencing not working with P2P in latest version.

LocalAI version:

localai/localai:latest-gpu-nvidia-cuda-12 LocalAI version: v2.22.1 (015835dba2854572d50e167b7cade05af41ed214)

Environment, CPU architecture, OS, and Version:

Linux localai3 6.8.12-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-2 (2024-09-05T10:03Z) x86_64 GNU/Linux (Proxmox LXC, Debian. AMD EPYC 7302P (16 cores allocated)/64GB RAM

Describe the bug

When testing distributed inferencing, i select a model (qwen 2.5 14b), send a chat message, the model loads on both instances (main and worker) and then the model does not respond and the model unloads on the worker. (watching with nvitop)

To Reproduce

description above should reproduce, i tried a few times.

Expected behavior

model should not unload & chat should complete

Logs

worker logs

{"level":"INFO","time":"2024-10-26T05:07:23.924Z","caller":"discovery/dht.go:115","message":" Bootstrapping DHT"}
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX 2000 Ada Generation, compute capability 8.9, VMM: yes
Starting RPC server on 127.0.0.1:46609, backend memory: 16380 MB
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed

Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed
Accepted client connection, free_mem=17175674880, total_mem=17175674880
Client connection closed

main logs

5:25AM INF Success ip=my.ip.address latency="960.876µs" method=POST status=200 url=/v1/chat/completions
5:25AM INF Trying to load the model 'qwen2.5-14b-instruct' with the backend '[llama-cpp llama-ggml llama-cpp-fallback rwkv stablediffusion whisper piper huggingface bert-embeddings /build/backend/python/rerankers/run.sh /build/backend/python/diffusers/run.sh /build/backend/python/vall-e-x/run.sh /build/backend/python/parler-tts/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/mamba/run.sh /build/backend/python/openvoice/run.sh /build/backend/python/coqui/run.sh /build/backend/python/bark/run.sh /build/backend/python/transformers-musicgen/run.sh /build/backend/python/transformers/run.sh /build/backend/python/exllama2/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/autogptq/run.sh /build/backend/python/vllm/run.sh]'
5:25AM INF [llama-cpp] Attempting to load
5:25AM INF Loading model 'qwen2.5-14b-instruct' with backend llama-cpp
5:25AM INF [llama-cpp-grpc] attempting to load with GRPC variant
5:25AM INF Redirecting 127.0.0.1:35625 to /ip4/worker-ip/udp/44701/quic-v1
5:25AM INF Redirecting 127.0.0.1:35625 to /ip4/worker-ip/udp/44701/quic-v1
5:25AM INF Redirecting 127.0.0.1:35625 to /ip4/worker-ip/udp/44701/quic-v1
5:25AM INF Redirecting 127.0.0.1:35625 to /ip4/worker-ip/udp/44701/quic-v1
5:25AM INF Success ip=127.0.0.1 latency="35.55µs" method=GET status=200 url=/readyz
5:26AM INF Node localai-oYURMqpWCR is offline, deleting
Error accepting:  accept tcp 127.0.0.1:35625: use of closed network connection

Additional context

this worked in the last version, though i'm not sure what that was at this point (~2 weeks ago) model loads and works fine without the worker.

Oct 26 '24 05:10 j4ys0n

i'm using docker compose, here's the config. https://github.com/j4ys0n/local-ai-stack

Oct 26 '24 05:10 j4ys0n

it is related to edgevpn, it somehow see peer's and addresses but cannot connect with them.

I has tried NAT traversal using libp2p+pubsub and I have managed to make peer discovery and establish p2p connection by randez-vous point.

In your case, if you know your worker address you can just put worker address into ENV of local-ai as gRPC external backend addresses.

Nov 16 '24 00:11 JackBekket

@mudler

fixed by #4220 more info in duplicate (sorry for that) #4214

Nov 28 '24 16:11 mintyleaf

fixed by https://github.com/mudler/LocalAI/pull/4220 more info in duplicate (sorry for that) https://github.com/mudler/LocalAI/issues/4214

looks like it still didn't make to the images quay.io/go-skynet/local-ai:latest-cpu or the (dockerhub) localai/localai:latest-cpu?

Dec 04 '24 14:12 vpereira

Still not able to do p2p inferencing even if workers are online v2.25.0 (07655c0c2e0e5fe2bca86339a12237b69d258636)

server and workers envs

CONTEXT_SIZE: "512"
THREADS: "4"
MODELS_PATH: /models
LLAMACPP_PARALLEL: "999"
TOKEN: &p2ptoken "xxx"
P2P_TOKEN: *p2ptoken
LOCALAI_P2P_TOKEN: *p2ptoken
LOCALAI_P2P_LISTEN_MADDRS: /ip4/0.0.0.0/tcp/8888

Server args: run --p2p

Worker args: worker p2p-llama-cpp-rpc '--llama-cpp-args=-H 0.0.0.0 -p 8082 -m 4096'

Container ports 8082 and 8888 are open using ports.containerPort

k8s based setup

attaching logs below for server and one of the worker out of 2

local-ai-worker.log local-ai-server.log

Feb 11 '25 07:02 pratikbin

Still not able to do p2p inferencing even if workers are online v2.25.0 (07655c0c2e0e5fe2bca86339a12237b69d258636)

server and workers envs

CONTEXT_SIZE: "512" THREADS: "4" MODELS_PATH: /models LLAMACPP_PARALLEL: "999" TOKEN: &p2ptoken "xxx" P2P_TOKEN: *p2ptoken LOCALAI_P2P_TOKEN: *p2ptoken LOCALAI_P2P_LISTEN_MADDRS: /ip4/0.0.0.0/tcp/8888

Server args: run --p2p

Worker args: worker p2p-llama-cpp-rpc '--llama-cpp-args=-H 0.0.0.0 -p 8082 -m 4096'

Container ports 8082 and 8888 are open using ports.containerPort

k8s based setup

attaching logs below for server and one of the worker out of 2

local-ai-worker.log local-ai-server.log

@pratikbin can you share the server logs with debug enabled? DEBUG=true. It looks like some versions incompatibilities.

Also, could you try with master images to double check? that'd be helpful. Thank you!

Feb 13 '25 18:02 mudler

There you go. let me know if you need anything else

localai-server.log localai-worker-1.log

Feb 14 '25 08:02 pratikbin

I cannot say if I have the EXACT same issue. I haven't debugged the libp2p/edgevpn code yet.

But P2P doesn't work.

HOWEVER, if I run local-ai worker llama-cpp-rpc --llama-cpp-args=" -p 8080" on my worker node and do LLAMACPP_GRPC_SERVERS="192.168.1.236:8080" DEBUG=true local-ai run falcon3-1b-instruct

It DOES work. I run a small 1B model to just verify that something runs as a baseline for this functionality.

I can manage with this, though the P2P stuff is really helpful in auto discovery.

the host local ai instance also detects peers fine, the GRPC just never seems to get the LLAMACPP_GRPC_SERVERS updated per the logs.

Oh and an obvious thing I notice is the defaulting to binding to 127.0.0.1 but IDK if the VPN somehow bypasses the loopback limitation somehow.

Feb 17 '25 00:02 pcfreak30

There you go. let me know if you need anything else

localai-server.log localai-worker-1.log

mmh ok that looks weird: what's the environment? it looks like they can auto-discover correctly, but it exhausts somehow resource limits. Typically that is set by looking at the system env.

Did you also try to bump the UDP buffer sizes? https://github.com/quic-go/quic-go/wiki/UDP-Buffer-Sizes#non-bsd

Feb 17 '25 14:02 mudler

I cannot say if I have the EXACT same issue. I haven't debugged the libp2p/edgevpn code yet.

But P2P doesn't work.

HOWEVER, if I run local-ai worker llama-cpp-rpc --llama-cpp-args=" -p 8080" on my worker node and do LLAMACPP_GRPC_SERVERS="192.168.1.236:8080" DEBUG=true local-ai run falcon3-1b-instruct

It DOES work. I run a small 1B model to just verify that something runs as a baseline for this functionality.

I can manage with this, though the P2P stuff is really helpful in auto discovery.

the host local ai instance also detects peers fine, the GRPC just never seems to get the LLAMACPP_GRPC_SERVERS updated per the logs.

Oh and an obvious thing I notice is the defaulting to binding to 127.0.0.1 but IDK if the VPN somehow bypasses the loopback limitation somehow.

shall try this one

Feb 25 '25 08:02 pratikbin

Is there a fix for this? Because I think the p2p feature is also broken on the Jetson.

Apr 30 '25 16:04 AaronDinesh

Is there a fix for this? Because I think the p2p feature is also broken on the Jetson.

prob upto mudler. I have not been able to justify the time to dig into this problem and my priorities have shifted for now.

Apr 30 '25 16:04 pcfreak30

prob upto mudler. I have not been able to justify the time to dig into this problem and my priorities have shifted for no

Thats no worries. Is there a specific commit this was broken in? Im thinking that I can roll back to before that commit and rebuild the container image from scratch

May 01 '25 06:05 AaronDinesh

prob upto mudler. I have not been able to justify the time to dig into this problem and my priorities have shifted for no

Thats no worries. Is there a specific commit this was broken in? Im thinking that I can roll back to before that commit and rebuild the container image from scratch

no idea. you will have to ride the commit tree yourself. good luck 🖖

May 01 '25 06:05 pcfreak30

I have this issue also with the same errors as the original poster.

I do note that when I try and run a model I get an error such as the following in the logs:

DBG GRPC(hermes-2-pro-llama-3-8b:Q8_0-127.0.0.1:43077): stderr load_tensors: tensor 'token_embd.weight' (q8_0) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead

I had thought that Q8 and F16 models were supported....

May 13 '25 07:05 trick-1

Hi team,

I tried to reproduce this issue and noticed that my logs looked almost identical to what was described in the original post:

Reproduction Steps The following command reproduces the issue:

Server:

sudo docker run -it --rm \
  -e DEBUG=true \
  --net host \
  --name local-ai \
  localai/localai:latest-cpu \
  run moondream2 --p2p

Worker:

sudo docker run -it --rm \
  -e DEBUG=true \
  -e TOKEN=$TOKEN \
  --name localai-worker \
  localai/localai:latest-cpu \
  worker p2p-llama-cpp-rpc

Worker Logs: This is the similar log pattern reported in the original post

Accepted client connection, free_mem=135034642432, total_mem=135034642432
Client connection closed
Accepted client connection, free_mem=135034642432, total_mem=135034642432
Client connection closed
Accepted client connection, free_mem=135034642432, total_mem=135034642432
Client connection closed
Accepted client connection, free_mem=135034642432, total_mem=135034642432
Client connection closed
Accepted client connection, free_mem=135034642432, total_mem=135034642432
Client connection closed

Observation The total_mem value matches my host system’s memory as reported by free -b, suggesting the container sees the full host memory instead of a limited one.

What I Tried I suspected it might be memory-related, so I re-ran with explicit Docker memory limits:

Server:

sudo docker run -it --rm \
  -m="8g" --memory-swap="8g" \
  -e DEBUG=true \
  --net host \
  --name local-ai \
  localai/localai:latest-cpu \
  run moondream2 --p2p

Worker:

sudo docker run -it --rm \
  -m="8g" --memory-swap="8g" \
  -e DEBUG=true \
  -e TOKEN=$TOKEN \
  --name localai-worker \
  localai/localai:latest-cpu \
  worker p2p-llama-cpp-rpc

After doing this, the worker behaved as expected.

Conclusion I’m not entirely sure, but this might be related to Docker not enforcing memory limits unless explicitly set. Adding -m and --memory-swap resolved it in my case.

Hope this helps!

Jun 02 '25 04:06 chengchialai0719

Since I'm having the same issues I tried @chengchialai0719 setup but without success. I can see the worker node in the "swarm" tab but as soon as i start chatting (with f.i. qwen3-1.7b which is small in size) I get this in the container's logs on the main node:

1:43PM ERR failed to install model "Qwen3-1.7B.Q4_K_M.gguf" from gallery error="no model found with name \"Qwen3-1.7B.Q4_K_M.gguf\""

1:43PM INF Trying to load the model 'qwen3-1.7b' with the backend '[llama-cpp llama-cpp-fallback piper silero-vad stablediffusion-ggml whisper huggingface]'

1:43PM INF [llama-cpp] Attempting to load

1:43PM INF BackendLoader starting backend=llama-cpp modelID=qwen3-1.7b o.model=Qwen3-1.7B.Q4_K_M.gguf

1:43PM INF [llama-cpp-grpc] attempting to load with GRPC variant

1:43PM INF Redirecting 127.0.0.1:40819 to /ip4/<ip-of-worker-node>/udp/53460/quic-v1

1:43PM INF Redirecting 127.0.0.1:40819 to /ip4/<ip-of-worker-node>/udp/53460/quic-v1

1:43PM INF Redirecting 127.0.0.1:40819 to /ip4/<ip-of-worker-node>/udp/53460/quic-v1

1:43PM INF Redirecting 127.0.0.1:40819 to /ip4/<ip-of-worker-node>/udp/53460/quic-v1

1:43PM INF Redirecting 127.0.0.1:40819 to /ip4/<ip-of-worker-node>/udp/53460/quic-v1

1:43PM INF Success ip=127.0.0.1 latency="40.648µs" method=GET status=200 url=/readyz

1:44PM INF Node PC-VTzNyTIclT is offline, deleting

Error accepting:  accept tcp 127.0.0.1:40819: use of closed network connection

and this on the worker node:

create_backend: using CPU backend

Starting RPC server v2.0.0

  endpoint       : 127.0.0.1:44453

  local cache    : n/a

  backend memory : 15208 MB

Accepted client connection, free_mem=15947444224, total_mem=15947444224

Client connection closed

Accepted client connection, free_mem=15947444224, total_mem=15947444224

Client connection closed

Accepted client connection, free_mem=15947444224, total_mem=15947444224

Client connection closed

Accepted client connection, free_mem=15947444224, total_mem=15947444224

Client connection closed

Accepted client connection, free_mem=15947444224, total_mem=15947444224

Client connection closed

I also tried using the vulkan image on both nodes as well as the aio-cpu image on both (latest in all cases) and sadly still had no success. The model exists on both nodes since I mount a folder to each docker container which contains the models. And this works fine if using the vulcan image on the worker node without p2p, so this is not the issue.

I also get this error if using this setup: main node:

version: "3.9"
services:
  api:
    image: localai/localai:latest-aio-cpu
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
      interval: 1m
      timeout: 20m
      retries: 5
    ports:
      - 8080:8080
    volumes:
      - /appdata/local-ai/models:/build/models
    environment:
      - TOKEN=XXX
      - LOCALAI_P2P=true
      - LOCALAI_MODELS_PATH=/build/models
    network_mode: host
    command: ["run", "--p2p"]
    deploy:
      resources:
        limits:
          memory: 8g
        reservations:
          memory: 8g
    mem_swappiness: 0

worker node:

version: "3.9"
services:
  api:
    image: localai/localai:latest-aio-cpu
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
      interval: 1m
      timeout: 20m
      retries: 5
    ports:
      - 8080:8080
    command: ["worker", "p2p-llama-cpp-rpc"]
    # I left this here since I swapped with the vulcan image trying to get this working
    devices:
      - '/dev/kfd:/dev/kfd'
      - '/dev/dri:/dev/dri'
    group_add:
      - video
    environment:
      - LOCALAI_MODELS_PATH=/build/models
      - LOCALAI_P2P=true
      - TOKEN=XXX
    network_mode: host
    volumes:
      - /appdata/local-ai/models:/build/models
    deploy:
      resources:
        limits:
          memory: 8g
        reservations:
          memory: 8g
    mem_swappiness: 0

Any idea how to fix this issue?

Jul 09 '25 14:07 otherippo

Did you ever manage to get this working? I have substantially the same setup, and the same outcomes. The workers check in just fine, but when I rinse a request to th master, it simply never sends a request to the worker.

Aug 23 '25 02:08 mcowger

Unfortunately, no. I instead switched to Llama.cpp backend with RPC to get the same functionality. Surely it's more complex (also required building the backend myself since RPC is not included by default) but in the end it was worth it. I was able to distribute a pretty large LLM over three GPUs with 8GB of VRAM each, and it run pretty quick even though only connected using a 1 Gbit link.

-------- Ursprüngliche Nachricht -------- Am 23.08.25 04:51 schrieb Matt Cowger :

mcowger left a comment (mudler/LocalAI#3968)

Did you ever manage to get this working? I have substantially the same setup, and the same outcomes. The workers check in just fine, but when I rinse a request to th master, it simply never sends a request to the worker.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

Sep 29 '25 23:09 otherippo

Hitting same issue, but it seems that it may be related to changes in llama.cpp, not Local AI itself. Perhaps this: https://github.com/ggml-org/llama.cpp/commit/898acba6816ad23b6a9491347d30e7570bffadfd?

Nov 03 '25 22:11 Expro