Support PyTorch reduce_all in DDP using SMDDP so we can Parallelize Validation

Open CountingClouds opened this issue 2 years ago • 1 comments

Please support reduce_all using the Data Parallel DDP API for tensors while using the SMDPP backend so that we can run our validation runs in parallel rather than on a single rank.

validation_loss = dist.all_reduce(loss, op=ReduceOp.AVG)

Results in:

RuntimeError: SMDDP does not support: ReduceOp

For example, so we don't have to restrict validation to a single rank like this:

https://github.com/aws/amazon-sagemaker-examples/blob/c266495f4a4b8e9c65f288edad5f0729c5ca3959/training/distributed_training/pytorch/data_parallel/mnist/code/train_pytorch_smdataparallel_mnist.py#L255

Apr 28 '23 22:04 CountingClouds

Hi, may I ask a follow-up question? I am trying to follow the DDP (Distributed Data Parallel) guidance (Guide 1, Guide 2) and deploy my deep learning models to AWS SageMaker Training Jobs. However, when running it, I am encountering the same error as yours. RuntimeError: SMDDP does not support: ReduceOp

May I ask if there is any quick fix to this issue? For example, do I need to modify the training_step/validation_step?

def training_step(self, batch, batch_idx):
        img1, img2, labels = batch
        feat1, feat2 = self(img1, img2)
        loss = self.compute_loss(feat1, feat2, labels)
        self.log("train_loss", loss, prog_bar=True, on_epoch=True, sync_dist=True, logger=True, on_step=False)
        return loss

def validation_step(self, batch, batch_idx):
      img1, img2, labels = batch
      feat1, feat2 = self(img1, img2)
      loss = self.compute_loss(feat1, feat2, labels)
      self.log("val_loss", loss, prog_bar=True, on_epoch=True, sync_dist=True, logger=True, on_step=False)
      return loss

Or anything I need to change in the DDP settings?

# Set up ddp on SageMaker
ddp = DDPStrategy(
      cluster_environment=env, 
      process_group_backend="smddp", 
      accelerator="gpu"
  )

# Initialize the PyTorch Lightning Trainer
trainer = pl.Trainer(
      max_epochs=args.epochs,
      strategy=ddp,                      # Distributed Data Parallel strategy
      devices=torch.cuda.device_count(),    # Use all available GPUs
      precision=16,                        # Use mixed precision (16-bit)
      callbacks=[checkpoint_callback, early_stopping_callback],
      log_every_n_steps=10,
      logger=csv_logger,
  )

# Train the model
trainer.fit(model, datamodule=data_module)

Thank you so much in advance!

Nov 09 '24 02:11 ZihanChen1995