Rui Pan 潘瑞
Rui Pan 潘瑞
Hey @ymjiang thanks for the info! Nevertheless, after switching to using benchmark_byteps.py, the issue is still there. FYI: * Here's the updated [Dockerfile](https://gist.github.com/ruipeterpan/f64ea87e5974b15a3b07bc93e50b0719), the only difference with the official one...
@vycezhong Thanks for your help. I double-checked to make sure no bps-related processes are alive (both inside and outside of all containers) before launching the server, yet it still crashes....
@vycezhong If only the server gets launched, it starts ZMQ recv thread and waits w/o an error. As soon as the workers are launched, the server crashes.
@vycezhong thanks for the fix! The server-crashing issue is resolved by #359, but I'm seeing some weird behavior for the training loss curve after applying the changes in the PR....
> @ruipeterpan You also need to enable async for workers. I had already toggled BYTEPS_ENABLE_ASYNC for all workers, servers & the scheduler for both async mode and sync mode.
@ymjiang Here's the loss curve I got for both sync and async using v0.2.4 (809ef20)
@vycezhong Here's what I got using https://github.com/bytedance/byteps/commit/7ac1dc74335b8935e4ac897e8d92d9c563fdf110 and the original scripts (bps_issue.py) I provided: Then I commented out a `metric_average()` on the training loss after each epoch ([this part](https://gist.github.com/ruipeterpan/70ac2dc7c72edcb2995130c5b83fb96a#file-bps_issue-py-L172-L176)), and...
@ymjiang Here's what I got using https://github.com/bytedance/byteps/commit/7ac1dc74335b8935e4ac897e8d92d9c563fdf110. The default is 0.05 and the loss curve was still going up after setting the lr to 0.0125. I also tried out some...
@vycezhong Here's what I got using https://github.com/bytedance/byteps/commit/18699f8932e404c0a8c97f847c1c06e0b4ec1fdf with 4 workers + 1 server. I don't know if this is related, but I should note that in the first epoch in...