CoLLiE icon indicating copy to clipboard operation
CoLLiE copied to clipboard

fix(dist_utils): fix port conflict in setup_distribution

Open gyt1145028706 opened this issue 1 year ago • 1 comments

如果端口冲突,则寻找一个未被占用的端口并修改 os.environ["MASTER_PORT"]

gyt1145028706 avatar May 02 '24 12:05 gyt1145028706

这样每个rank可能因为先后顺序,导致获得的master_port不一样。可以像torchrun一样直接报错终止程序,并提示用户修改环境变量。

KaiLv69 avatar May 06 '24 07:05 KaiLv69

用bind会出现 将可用的端口判为不可用的情况

比如按照提示export了新的端口 但是下一次用的时候还会检测到port used 为False 改为connect就没这个问题

---Original--- From: "Kai @.> Date: Thu, May 9, 2024 11:26 AM To: @.>; Cc: "Yitian @.@.>; Subject: Re: [OpenMOSS/CoLLiE] fix(dist_utils): fix port conflict insetup_distribution (PR #178)

@KaiLv69 commented on this pull request.

In collie/utils/dist_utils.py: > @@ -167,6 +168,24 @@ def _decompose_slurm_nodes(s): return results +def port_used(host: str, port: int) -> bool: + "检查端口是否被占用" + with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: + try: + s.connect((host, port)) # 尝试绑定到本地地址和指定端口
为什么这里从bind改成了connect?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

gyt1145028706 avatar May 09 '24 03:05 gyt1145028706

用bind会出现 将可用的端口判为不可用的情况 比如按照提示export了新的端口 但是下一次用的时候还会检测到port used 为False 改为connect就没这个问题 ---Original--- From: "Kai @.> Date: Thu, May 9, 2024 11:26 AM To: @.>; Cc: "Yitian @.@.>; Subject: Re: [OpenMOSS/CoLLiE] fix(dist_utils): fix port conflict insetup_distribution (PR #178) @KaiLv69 commented on this pull request. In collie/utils/dist_utils.py: > @@ -167,6 +168,24 @@ def _decompose_slurm_nodes(s): return results +def port_used(host: str, port: int) -> bool: + "检查端口是否被占用" + with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: + try: + s.connect((host, port)) # 尝试绑定到本地地址和指定端口 为什么这里从bind改成了connect? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

export环境变量本身并不会占用端口,s.connect在这里用不合适吧,你可以看看connect和bind的区别

KaiLv69 avatar May 09 '24 03:05 KaiLv69