Problems of distributed computing in federated learning
When using distributed operation, I have four Gpus, each of which has a client. During the training process, each GPU has a huge difference. Two gpus even ran out of memory. By the way, I also found that gpu training with overflow was extremely slow and seemed to have gpu utilization close to zero.
for images, targets in metric_logger.log_every(data_loader, print_freq, header): images = list(image.to(device) for image in images) targets = [{k: v.to(device) for k, v in t.items()} for t in targets] # a = True with torch.cuda.amp.autocast(enabled=scaler is not None): # print('1:', device, torch.cuda.memory_allocated() / (1024 * 1024)) loss_dict = model(images, targets) # An out-of-memory line of code # print('2:', device, torch.cuda.memory_allocated() / (1024*1024)) losses = sum(loss for loss in loss_dict.values())
What can I do to make other Gpus train properly?
@rG223 Hi may I access your environment? or please specify which source code are you using? We will have engineers to help you.
@rG223 are you still working on this issue? Here are examples for you to get started for distributed training: https://github.com/FedML-AI/FedML/tree/master/python/examples
Closing due to inactivity.