Distributed.jl icon indicating copy to clipboard operation
Distributed.jl copied to clipboard

Underministic behavior of `addprocs()` of `SSHManager`

Open JonnyKong opened this issue 2 years ago • 1 comments

Given an array of node addresses as input, this function returns an array of launched worker PIDs. However, the returned pids do not necesssarily match the order of input addresses.

For example, the outcome of (p1, p2) = addprocs([machine1, machine2]) may be p1 running on machine2 and p2 running on machine1, or vice versa.

The cause of such underministic behavior is that launch(manager::SSHManager, ...) launches workers in parallel. Upon launching each worker, the pid of that worker will be pushed to launched, where no synchronization / ordering is performed:

https://github.com/JuliaLang/Distributed.jl/blob/fd9d120e90a31a9c50aa7a360a108227aa55f212/src/managers.jl#L177-L185

While this is not a bug, this undeterministic behavior seems counter-intuitive and is error-prone.

JonnyKong avatar Oct 27 '23 00:10 JonnyKong

Reopening this issue (was closed by mistake).

JonnyKong avatar Oct 27 '23 01:10 JonnyKong