Support PyTorch reduce_all in DDP using SMDDP so we can Parallelize Validation
Please support reduce_all using the Data Parallel DDP API for tensors while using the SMDPP backend so that we can run our validation runs in parallel rather than on a single rank.
validation_loss = dist.all_reduce(loss, op=ReduceOp.AVG)
Results in:
RuntimeError: SMDDP does not support: ReduceOp
For example, so we don't have to restrict validation to a single rank like this:
https://github.com/aws/amazon-sagemaker-examples/blob/c266495f4a4b8e9c65f288edad5f0729c5ca3959/training/distributed_training/pytorch/data_parallel/mnist/code/train_pytorch_smdataparallel_mnist.py#L255
Hi, may I ask a follow-up question? I am trying to follow the DDP (Distributed Data Parallel) guidance (Guide 1, Guide 2) and deploy my deep learning models to AWS SageMaker Training Jobs. However, when running it, I am encountering the same error as yours. RuntimeError: SMDDP does not support: ReduceOp
May I ask if there is any quick fix to this issue? For example, do I need to modify the training_step/validation_step?
def training_step(self, batch, batch_idx):
img1, img2, labels = batch
feat1, feat2 = self(img1, img2)
loss = self.compute_loss(feat1, feat2, labels)
self.log("train_loss", loss, prog_bar=True, on_epoch=True, sync_dist=True, logger=True, on_step=False)
return loss
def validation_step(self, batch, batch_idx):
img1, img2, labels = batch
feat1, feat2 = self(img1, img2)
loss = self.compute_loss(feat1, feat2, labels)
self.log("val_loss", loss, prog_bar=True, on_epoch=True, sync_dist=True, logger=True, on_step=False)
return loss
Or anything I need to change in the DDP settings?
# Set up ddp on SageMaker
ddp = DDPStrategy(
cluster_environment=env,
process_group_backend="smddp",
accelerator="gpu"
)
# Initialize the PyTorch Lightning Trainer
trainer = pl.Trainer(
max_epochs=args.epochs,
strategy=ddp, # Distributed Data Parallel strategy
devices=torch.cuda.device_count(), # Use all available GPUs
precision=16, # Use mixed precision (16-bit)
callbacks=[checkpoint_callback, early_stopping_callback],
log_every_n_steps=10,
logger=csv_logger,
)
# Train the model
trainer.fit(model, datamodule=data_module)
Thank you so much in advance!