deep-learning-containers icon indicating copy to clipboard operation
deep-learning-containers copied to clipboard

[bug] test_cpu_nccl SM local test failure on PyTorch CPU image

Open tejaschumbalkar opened this issue 4 years ago • 0 comments

Checklist

  • [x] I've prepended issue tag with type of change: [bug]
  • [ ] (If applicable) I've attached the script to reproduce the bug
  • [ ] (If applicable) I've documented below the DLC image/dockerfile this relates to
  • [ ] (If applicable) I've documented below the tests I've run on the DLC image
  • [x] I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
  • [ ] I've built my own container based off DLC (and I've attached the code used to build my own image)

Concise Description:

The test_cpu_nccl SM local test is marked as green by the pytest run but the test actually fails with

RuntimeError: Distributed package doesn't have NCCL built in

We do not install on CPU image for PyTorch. Also, the test is being referred from sagemaker-pytorch-training-toolkit.

We need to confirm if the test is valid in the first place.

DLC image/dockerfile:

All Pytorch CPU images

tejaschumbalkar avatar Aug 19 '21 16:08 tejaschumbalkar