exo icon indicating copy to clipboard operation
exo copied to clipboard

Nodes do not find each other

Open schwaa opened this issue 1 year ago • 12 comments

I have 2 Ubuntu 22.04 Nodes, both started but do not find each other. Both nodes are on the same network. I have opened firewall port 52415 for tcp and udp. One machine has nvidia gpu and the other is an RK1 with 16GB. Both python 3.12. I can ssh between the two machines with no problem and their hostnames show in my network software. Both are on a hardline, not wifi. Not sure where I am going wrong here. There are no errors.

image

schwaa avatar Nov 30 '24 18:11 schwaa

I can chat the RTX node but the clang node has issues: Error: Failed to fetch completions: Error processing prompt (see logs with DEBUG>=2): [Errno 2] No such file or directory: 'clang' at Proxy.openaiChatCompletion (http://192.168.4.196:52415/index.js:375:17) at async Proxy.processMessage (http://192.168.4.196:52415/index.js:255:17) I'll try to search on this.

schwaa avatar Nov 30 '24 18:11 schwaa

Added a new node using an RTX4090 using ubuntu and it also doesn't get found on the network. Both nodes say cluster 1 node.

schwaa avatar Dec 01 '24 00:12 schwaa

same issue,I tried two windows with Nvidia GPU, thet failed to find each other. and I tried to use a mac mini and a unbuntu, but they also has only one node

cjy4979 avatar Dec 01 '24 16:12 cjy4979

Same, laptop with ubuntu and rtx 4080 + m1 max macbook pro

TechnicalParadox avatar Dec 03 '24 23:12 TechnicalParadox

Facing same issue: No such file or directory: 'clang' 2 laptops,

  1. Linux Mint 22 Wilma - Intel Core i7-8550U with 32GB RAM + Nvidia 2GB
  2. Linux Mint 22 - Intel core M based Macbook Air with 8 GB RAM (Linux machine, no macOS)

1st laptop sees both the machine, however the 2nd one keeps saying one node on the cluster !

Installing the model via Tiny chat did trigger activity on both machines but when hitting the chat completion API using the same model (Laama 3.2 1B), I keep getting the following error.

{"detail": "Error processing prompt (see logs with DEBUG>=2): <AioRpcError of RPC that terminated with:\n\tstatus = StatusCode.UNKNOWN\n\tdetails = \"Unexpected <class 'FileNotFoundError'>: [Errno 2] No such file or directory: 'clang'\"\n\tdebug_error_string = \"UNKNOWN:Error received from peer {created_time:\"2025-01-04T12:20:09.099867676+05:30\", grpc_status:2, grpc_message:\"Unexpected <class \\'FileNotFoundError\\'>: [Errno 2] No such file or directory: \\'clang\\'\"}\"\n>"}

I can run the model individually (without exo, using llama.cpp) on the 1st laptop with 20 tokens/sec - just in case if it helps

Update: I figured that possibly installing clang on the other laptop (shown with Clang in the network graph in exo) is something which could be tried, since the error is reporting that is missing.
So, I did this:

  1. sudo apt install clang
  2. pip install numba <-- gets the required llmvlite package

after the above 2 commands, the error is gone. And I see activity on both the nodes. The second (slower) node showed attempts to infer and generate response. However, the request timedout as the token generation was horribly slow. That is just with the 1B llama 3.2 model. Not sure what is wrong with the speed. This is all via /completions endpoint curl from terminal.

Running the inference without exo is still generating response at 20 tokens/sec for the 1B model.

Looks like, most of the work is being done by the slower node leading to horrible tokens/sec speed.

stackcv avatar Jan 04 '25 06:01 stackcv

I have encountered issues between my Macbook M1 and Windows10 WSL-Ubuntu 22, these two machines are on the same local network. And I forwarded ports 5678 and 52415 on Windows10 using netsh interface portproxy add v4tov4 and also opened firewall inbound rules for 5678 (UDP) and 52415 (TCP). The result is still that these two machines cannot discover each other. What is the reason? Thanks!

LGDHuaOPER avatar Mar 16 '25 15:03 LGDHuaOPER

I am facing the same issue, till now I have tried the following commands: exo --node-id 1 and exo --node-id 0 - Using different node IDs exo --node-port 11434 - specifying a port which I know for sure allowed on the local network.

Here are the details about the systems I am using (all are identical): OS : Red Hat Enterprise Linux 8.10 (Ootpa) GPU: NVIDIA Corporation GA102GL [RTX A5000] CPU: Intel(R) Core(TM) i9-10940X

anand-kamble avatar Mar 17 '25 01:03 anand-kamble

Having the same issue with a Macbook and a Windows-WSL setup. Can't see each other.

vacekj avatar Mar 17 '25 23:03 vacekj

Any fixes for this? Trying to connect two Ubuntu machines on the same local network (one via Wi-Fi, the other via Ethernet). Despite being on the same subnet, they don't discover each other and each detects only one node.

Isaac-opz avatar Mar 21 '25 21:03 Isaac-opz

Chiming in here, as I'm having the same problem. I've also tried setting the node ID's individually. One laptop (CPU with AMD APU) and one 'server' with a (lackluster) nvidia card, both on the same subnet with UFW allowing port 52415 tcp and udp.

Each seems to run fine individually, but cannot find the other node. Ubuntu 22.04 on both machines.

brewtide avatar Mar 26 '25 14:03 brewtide

Same here with two ubuntu servers and one MacBook. Looks like the most important promise of this project, distributed inference, is not living up to the hype. Will be happy to check back after 1 year.

Zhan-Li avatar Apr 06 '25 19:04 Zhan-Li

mee too

kekekekekeshi avatar Jul 04 '25 05:07 kekekekekeshi

Problem: I connected two Mac M1 machines as Exo nodes. When I start Exo both nodes appear in the cluster, but after some time one node disappears and does not rejoin. This happens intermittently and I couldn't find a pattern yet

any solution?

Devikpr avatar Sep 01 '25 09:09 Devikpr

Should be fixed in 1.0!

Evanev7 avatar Dec 18 '25 16:12 Evanev7