MPIClusterManagers.jl icon indicating copy to clipboard operation
MPIClusterManagers.jl copied to clipboard

MPI remote machine connection

Open yeonsookimdev opened this issue 8 years ago • 2 comments

I tried to use MPI.jl for connecting different computing nodes, and I found that there is no options for specifying the hosts. We can specify different remote hosts with mpiexec -hosts localhost,node1,node2 ... The default value of mpiexec_cmd parameter in MPIManager is mpiexec -np $np, so I tried mpiexec -np 2 -hosts node2,node3, and I got this error below.

julia> using MPI

julia> manager = MPIManager(np = 2, mpirun_cmd =
       `mpiexec -np 2 -hosts node2,node3`)
MPI.MPIManager(np=2,launched=false,mode=MPI_ON_WORKERS)

julia> addprocs(manager)
ERROR: connect: host is unreachable (EHOSTUNREACH)
Stacktrace:
 [1] try_yieldto(::Base.##296#297{Task}, ::Task) at ./event.jl:189
 [2] wait() at ./event.jl:234
 [3] wait(::Condition) at ./event.jl:27
 [4] stream_wait(::TCPSocket, ::Condition, ::Vararg{Condition,N} where N) at ./stream.jl:42
 [5] wait_connected(::TCPSocket) at ./stream.jl:258
 [6] connect at ./stream.jl:983 [inlined]
 [7] connect(::IPv4, ::Int64) at ./socket.jl:738
 [8] setup_worker(::Int64, ::Int64, ::Symbol) at /home/alkorang/.julia/v0.6/MPI/src/cman.jl:197
[proxy:0:0@node2] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:0@node2] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@node2] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@node2] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@node2] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@node2] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion
[mpiexec@node2] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
Error in MPI launch ErrorException("Timeout -- the workers did not connect to the manager")
ERROR (unhandled task failure): Timeout -- the workers did not connect to the manager
ERROR: Timeout -- the workers did not connect to the manager
Stacktrace:
 [1] wait(::Task) at ./task.jl:184
 [2] #addprocs_locked#30(::Array{Any,1}, ::Function, ::MPI.MPIManager) at ./distributed/cluster.jl:361
 [3] #addprocs#29(::Array{Any,1}, ::Function, ::MPI.MPIManager) at ./distributed/cluster.jl:319
 [4] addprocs(::MPI.MPIManager) at ./distributed/cluster.jl:315

I exectued this code on node2, and there is nothing wrong with the firewall settings and path because MPI program in C runs without any error.

I used CentOS 7, MPICH 3, Julia 0.6.0, and the same path on both nodes.

yeonsookimdev avatar Aug 03 '17 15:08 yeonsookimdev

I tested with mpirun_cmd = `mpiexec -np 2 -hosts localhost,node2` on node2 and it ran as expected.

julia> using MPI

julia> manager = MPIManager(np = 2, mpirun_cmd =
       `mpiexec -np 2 -hosts localhost,node2`)
MPI.MPIManager(np=2,launched=false,mode=MPI_ON_WORKERS)

julia> addprocs(manager)
2-element Array{Int64,1}:
 2
 3

julia> MPI.Init()

julia> @everywhere rank = MPI.Comm_rank(MPI.COMM_WORLD)

julia> @mpi_do println("id: $(myid()), rank: $(rank)")
ERROR: MethodError: no method matching @mpi_do(::Expr)
Closest candidates are:
  @mpi_do(::ANY, ::ANY) at /home/alkorang/.julia/v0.6/MPI/src/cman.jl:506

julia> @mpi_do manager println("id: $(myid()), rank: $(rank)")
        From worker 2:  id: 2, rank: 0
        From worker 3:  id: 3, rank: 1

julia>

I geuss the problem is from the port number used in MPIManager is fixed in source code. https://github.com/JuliaParallel/MPI.jl/blob/v0.6.0/src/cman.jl#L85-#L85

yeonsookimdev avatar Aug 03 '17 23:08 yeonsookimdev

Do we use mpiexec still? I am not sure, but I suspect things have been substantially improved / changed since this issue was opened.

Can we close it?

ViralBShah avatar May 25 '20 20:05 ViralBShah