FedML icon indicating copy to clipboard operation
FedML copied to clipboard

possible bug in python/fedml/core/distributed/communication/trpc/utils.py

Open bene-ges opened this issue 1 year ago • 0 comments

Hi,

I was trying to launch federate/cross_silo/cuda_rpc_fedavg_mnist_lr_example, mapping all processes (1 server and 2 clients) to a single gpu.

it ended with error

File "/home/myhome/.local/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py", line 235, in _validate_device_maps
    raise ValueError(
ValueError: Node worker0 has target devices with invalid indices in its device map for worker2
device map = {device(type='cuda', index=0): device(type='cuda', index=2)}
device count = 1

I suspect there is a bug in python/fedml/core/distributed/communication/trpc/utils.py

# Generate Device Map for Cuda RPC
def set_device_map(options, worker_idx, device_list):
    local_device = device_list[worker_idx]
    for index, remote_device in enumerate(device_list):
        logging.warn(f"Setting device map for client {index} as {remote_device}")
        if index != worker_idx:
            options.set_device_map(WORKER_NAME.format(index), {local_device: remote_device})

here device_list is a dict {0:0, 1:0, 2:0}, but enumerate iterates over its keys and then assigns the key (0,1,2) as local_device.

I tried to correct this as

    for index, remote_device in enumerate(device_list):
        logging.warn(f"Setting device map for client {index} as {device_list[remote_device]}")
        if index != worker_idx:
            options.set_device_map(WORKER_NAME.format(index), {local_device: device_list[remote_device]})

and the example worked ok.

bene-ges avatar Mar 29 '24 14:03 bene-ges