parameter_server icon indicating copy to clipboard operation
parameter_server copied to clipboard

System hangs when a server is killed

Open DanishKhan14 opened this issue 9 years ago • 0 comments

The system hangs (node where scheduler is running stops displaying iteration outcomes on the terminal. Logs are also not generated) whenever a server process is killed (both on local machine or on remote machine). Is it the expected behaviour ? Shouldn't it still continue to run with gradients being updated on the backup/replicated server node as described in the paper ?

Here are the steps that I ran (from "parameter_server/example/linear" dir):

../../script/ps.sh start -nw 4 -ns 3 -hostfile hostfile ../../build/linear -app_file ctr/online_l1lr.conf -num_replicas 2 -report_interval 1

Then I killed a server process on one of the nodes. This stops the system. Killing a worker node, still continues the SGD and converges eventually.

Any help in this regard will be highly appreciated.

Thanks, Danish

DanishKhan14 avatar Dec 14 '16 22:12 DanishKhan14