Diffusion-based-Segmentation icon indicating copy to clipboard operation
Diffusion-based-Segmentation copied to clipboard

error when training on multiple gpus

Open CNGaoWenbo opened this issue 1 year ago • 1 comments

I initialized the multiple training using torchrun, but it stuck here.

Setting up a new session... Setting up a new session... Setting up a new session...

Does anyone have an idea? thanks

CNGaoWenbo avatar Mar 18 '24 14:03 CNGaoWenbo

I only changed dist_util `GPUS_PER_NODE = 4 #change to 4

SETUP_RETRY_COUNT = 3

def setup_dist():

if dist.is_initialized():
    return
os.environ["CUDA_VISIBLE_DEVICES"] = '6,7,8,9' #change to '6,7,8,9'

backend = "gloo" if not th.cuda.is_available() else "nccl"

if backend == "gloo":
    hostname = "localhost"
else:
    hostname = socket.gethostbyname(socket.getfqdn())
os.environ["MASTER_ADDR"] = '127.0.1.1'#comm.bcast(hostname, root=0)
os.environ["RANK"] = '0'#str(comm.rank)
os.environ["WORLD_SIZE"] = '4'# change to 4

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(("", 0))
s.listen(1)
port = s.getsockname()[1]
s.close()
os.environ["MASTER_PORT"] = str(port)
dist.init_process_group(backend=backend, init_method="env://")`

CNGaoWenbo avatar Mar 18 '24 14:03 CNGaoWenbo