FedML Problems of distributed computing in federated learning

When using distributed operation, I have four Gpus, each of which has a client. During the training process, each GPU has a huge difference. Two gpus even ran out of memory. By the way, I also found that gpu training with overflow was extremely slow and seemed to have gpu utilization close to zero.

for images, targets in metric_logger.log_every(data_loader, print_freq, header): images = list(image.to(device) for image in images) targets = [{k: v.to(device) for k, v in t.items()} for t in targets] # a = True with torch.cuda.amp.autocast(enabled=scaler is not None): # print('1:', device, torch.cuda.memory_allocated() / (1024 * 1024)) loss_dict = model(images, targets) # An out-of-memory line of code # print('2:', device, torch.cuda.memory_allocated() / (1024*1024)) losses = sum(loss for loss in loss_dict.values())

Feb 01 '22 13:02 rG223

What can I do to make other Gpus train properly?

Feb 01 '22 13:02 rG223

@rG223 Hi may I access your environment? or please specify which source code are you using? We will have engineers to help you.

Mar 07 '22 23:03 chaoyanghe

@rG223 are you still working on this issue? Here are examples for you to get started for distributed training: https://github.com/FedML-AI/FedML/tree/master/python/examples

Aug 19 '22 17:08 chaoyanghe

Closing due to inactivity.

Oct 24 '23 19:10 fedml-dimitris