Rui Pan 潘瑞

Results 11 comments of Rui Pan 潘瑞

Hey @ymjiang thanks for the info! Nevertheless, after switching to using benchmark_byteps.py, the issue is still there. FYI: * Here's the updated [Dockerfile](https://gist.github.com/ruipeterpan/f64ea87e5974b15a3b07bc93e50b0719), the only difference with the official one...

@vycezhong Thanks for your help. I double-checked to make sure no bps-related processes are alive (both inside and outside of all containers) before launching the server, yet it still crashes....

@vycezhong If only the server gets launched, it starts ZMQ recv thread and waits w/o an error. As soon as the workers are launched, the server crashes.

@vycezhong thanks for the fix! The server-crashing issue is resolved by #359, but I'm seeing some weird behavior for the training loss curve after applying the changes in the PR....

> @ruipeterpan You also need to enable async for workers. I had already toggled BYTEPS_ENABLE_ASYNC for all workers, servers & the scheduler for both async mode and sync mode.

@ymjiang Here's the loss curve I got for both sync and async using v0.2.4 (809ef20)

@vycezhong Here's what I got using https://github.com/bytedance/byteps/commit/7ac1dc74335b8935e4ac897e8d92d9c563fdf110 and the original scripts (bps_issue.py) I provided: Then I commented out a `metric_average()` on the training loss after each epoch ([this part](https://gist.github.com/ruipeterpan/70ac2dc7c72edcb2995130c5b83fb96a#file-bps_issue-py-L172-L176)), and...

@ymjiang Here's what I got using https://github.com/bytedance/byteps/commit/7ac1dc74335b8935e4ac897e8d92d9c563fdf110. The default is 0.05 and the loss curve was still going up after setting the lr to 0.0125. I also tried out some...

@vycezhong Here's what I got using https://github.com/bytedance/byteps/commit/18699f8932e404c0a8c97f847c1c06e0b4ec1fdf with 4 workers + 1 server. I don't know if this is related, but I should note that in the first epoch in...